1 Introduction

1.1 What is Bowtie 2?

Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes.

1.2 What is BWA?

BWA is a software package for mapping DNA sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences ranged from 70bp to a few megabases. BWA-MEM and BWA-SW share similar features such as the support of long reads and chimeric alignment, but BWA-MEM, which is the latest, is generally recommended as it is faster and more accurate.

3 Tutorial

In this tutorial we are going to explore how to align short reads to a reference genome. Two of the most popular short read aligners used in bioinformatics are the Bowtie 2 and BWA aligners. Both of these aligners use the Burrows-Wheeler transform to build an index of the reference genome, however they differ from each other in respect of the alignment algorithm. It is important to say up front the alignment problem has not been solved - researchers continue to working on optimizing the efficiency of the read alignment procedure and strategies for determining the position of multi-mapping reads. With that said, lets get to the tutorial. The first thing to do is create a directory to store all the tutorial output files and data.

Create a working directory for this tutorial:

bash
mkdir tutorial

3.1 Software installation

Create a new environment called ‘alignment’ with all the required software installed:

bash
conda create --yes --name alignment bowtie2 bwa entrez-direct samtools seqkit sra-tools=2.11.0  # only this version works currently (21/09/2022) 
## Collecting package metadata (current_repodata.json): ...working... done
## Solving environment: ...working... failed with repodata from current_repodata.json, will retry with next repodata source.
## Collecting package metadata (repodata.json): ...working... done
## Solving environment: ...working... done
## 
## ## Package Plan ##
## 
##   environment location: /opt/miniconda3/envs/alignment
## 
##   added / updated specs:
##     - bowtie2
##     - bwa
##     - entrez-direct
##     - samtools
##     - seqkit
##     - sra-tools==2.11.0=pl5262h37d2149_1
## 
## 
## The following packages will be downloaded:
## 
##     package                    |            build
##     ---------------------------|-----------------
##     tbb-2021.5.0               |       hb8565cd_5         154 KB  conda-forge
##     ------------------------------------------------------------
##                                            Total:         154 KB
## 
## The following NEW packages will be INSTALLED:
## 
##   bowtie2            bioconda/osx-64::bowtie2-2.4.5-py310h7d4de36_4
##   bwa                bioconda/osx-64::bwa-0.7.17-h1f540d2_9
##   bzip2              conda-forge/osx-64::bzip2-1.0.8-h0d85af4_4
##   c-ares             conda-forge/osx-64::c-ares-1.18.1-h0d85af4_0
##   ca-certificates    conda-forge/osx-64::ca-certificates-2022.9.14-h033912b_0
##   curl               conda-forge/osx-64::curl-7.83.1-h23f1065_0
##   entrez-direct      bioconda/osx-64::entrez-direct-16.2-h193322a_1
##   gettext            conda-forge/osx-64::gettext-0.19.8.1-hd1a6beb_1008
##   hdf5               conda-forge/osx-64::hdf5-1.10.6-nompi_haae91d6_101
##   htslib             bioconda/osx-64::htslib-1.16-h567f53e_0
##   icu                conda-forge/osx-64::icu-70.1-h96cf925_0
##   krb5               conda-forge/osx-64::krb5-1.19.3-hb98e516_0
##   libcurl            conda-forge/osx-64::libcurl-7.83.1-h23f1065_0
##   libcxx             conda-forge/osx-64::libcxx-14.0.6-hccf4f1f_0
##   libdeflate         conda-forge/osx-64::libdeflate-1.13-h775f41a_0
##   libedit            conda-forge/osx-64::libedit-3.1.20191231-h0678c8f_2
##   libev              conda-forge/osx-64::libev-4.33-haf1e3a3_1
##   libffi             conda-forge/osx-64::libffi-3.4.2-h0d85af4_5
##   libgfortran        conda-forge/osx-64::libgfortran-4.0.0-7_5_0_h1a10cd1_23
##   libgfortran4       conda-forge/osx-64::libgfortran4-7.5.0-h1a10cd1_23
##   libiconv           conda-forge/osx-64::libiconv-1.16-haf1e3a3_0
##   libidn2            conda-forge/osx-64::libidn2-2.3.3-hac89ed1_0
##   libnghttp2         conda-forge/osx-64::libnghttp2-1.47.0-h5aae05b_1
##   libsqlite          conda-forge/osx-64::libsqlite-3.39.3-ha978bb4_0
##   libssh2            conda-forge/osx-64::libssh2-1.10.0-h47af595_3
##   libunistring       conda-forge/osx-64::libunistring-0.9.10-h0d85af4_0
##   libxml2            conda-forge/osx-64::libxml2-2.9.14-hea49891_4
##   libzlib            conda-forge/osx-64::libzlib-1.2.12-hfd90126_3
##   llvm-openmp        conda-forge/osx-64::llvm-openmp-14.0.4-ha654fa7_0
##   ncbi-ngs-sdk       bioconda/osx-64::ncbi-ngs-sdk-2.11.2-h247ad82_0
##   ncurses            conda-forge/osx-64::ncurses-6.3-h96cf925_1
##   openssl            conda-forge/osx-64::openssl-3.0.5-hfd90126_2
##   ossuuid            conda-forge/osx-64::ossuuid-1.6.2-h0a44026_1000
##   perl               conda-forge/osx-64::perl-5.26.2-hbcb3906_1008
##   perl-uri           bioconda/osx-64::perl-uri-1.71-pl526_3
##   perl-xml-libxml    bioconda/osx-64::perl-xml-libxml-2.0132-pl526h08abf6f_1
##   perl-xml-namespac~ bioconda/osx-64::perl-xml-namespacesupport-1.11-pl526_1
##   perl-xml-sax       bioconda/osx-64::perl-xml-sax-0.99-pl526_1
##   perl-xml-sax-base  bioconda/osx-64::perl-xml-sax-base-1.09-pl526_0
##   pip                conda-forge/noarch::pip-22.2.2-pyhd8ed1ab_0
##   python             conda-forge/osx-64::python-3.10.6-hc14f532_0_cpython
##   python_abi         conda-forge/osx-64::python_abi-3.10-2_cp310
##   readline           conda-forge/osx-64::readline-8.1.2-h3899abd_0
##   samtools           bioconda/osx-64::samtools-1.15.1-h7e39424_1
##   seqkit             bioconda/osx-64::seqkit-2.3.1-h527b516_0
##   setuptools         conda-forge/noarch::setuptools-65.3.0-pyhd8ed1ab_1
##   sra-tools          bioconda/osx-64::sra-tools-2.11.0-pl5262h37d2149_1
##   tbb                conda-forge/osx-64::tbb-2021.5.0-hb8565cd_5
##   tk                 conda-forge/osx-64::tk-8.6.12-h5dbffcc_0
##   tzdata             conda-forge/noarch::tzdata-2022c-h191b570_0
##   wget               conda-forge/osx-64::wget-1.20.3-hd3787cc_1
##   wheel              conda-forge/noarch::wheel-0.37.1-pyhd8ed1ab_0
##   xz                 conda-forge/osx-64::xz-5.2.6-h775f41a_0
##   zlib               conda-forge/osx-64::zlib-1.2.12-hfd90126_3
##   zstd               conda-forge/osx-64::zstd-1.5.2-hfa58983_4
## 
## 
## 
## Downloading and Extracting Packages
## 
tbb-2021.5.0         | 154 KB    |            |   0% 
tbb-2021.5.0         | 154 KB    | #          |  10% 
tbb-2021.5.0         | 154 KB    | ####1      |  41% 
tbb-2021.5.0         | 154 KB    | ######2    |  62% 
tbb-2021.5.0         | 154 KB    | ########## | 100% 
## Preparing transaction: ...working... done
## Verifying transaction: ...working... done
## Executing transaction: ...working... done
## #
## # To activate this environment, use
## #
## #     $ conda activate alignment
## #
## # To deactivate an active environment, use
## #
## #     $ conda deactivate
## 
## Retrieving notices: ...working... done

Activate the ‘alignment’ environment:

bash
conda activate alignment

Test that Bowtie 2 and BWA are available:

bash
which bowtie2
which bwa
## /opt/miniconda3/envs/alignment/bin/bowtie2
## /opt/miniconda3/envs/alignment/bin/bwa

3.2 Data download

The data we are going to use in this tutorial is from the original research paper which first reported the genome sequence of the SARS-CoV-2 virus responsible for the COVID-19 pandemic across the globe. The paper is listed below for you to read in your own time, if you are so inclined:

A new coronavirus associated with human respiratory disease in China

Within the paper there is a ‘Data availability’ section telling you where to download the sequencing data. To save you some time, I can tell you that the sequencing data was deposited in the SRA database under the study accession number SRP245409. Have a look through the study page to get familiar with the data, ask yourself questions like ‘how many runs are in this study?’, ‘what sequencing machine was used?’, ‘how many reads did they sequence?’, etc.

Output

As well as the sequencing data, we also need to download a copy of the SARS-CoV-2 reference genome. The NCBI hosts the reference genome in the GenBank database under the accession number MN908947. Again, get yourself acquainted with the data by exploring the GenBank page for the genome, ask yourself questions like ‘who submitted the genome sequence?’, ‘how long is the genome?’, ‘how many genes do it contain?’, etc.

Output

Now that you’ve had a good rummage around, lets get to business and download the data. First, create a directory to actually store the sequencing data and reference genome.

bash
mkdir tutorial/data

Although we discussed the advantages of short read aligners in the context of large genomes. This tutorial would take much longer if we went down that route. Instead, small viral genomes such as SARS-CoV-2 are an excellent substitute for teaching. Use the efetch command from EDirect to download the reference genome:

bash
efetch -db nuccore -id MN908947 -format fasta > tutorial/data/MN908947.fasta

To save time and memory requirements, we are only going to use a small sample of the sequencing data. Use the fastq-dump command from the SRA Toolkit to download just 1,000 of the paired-end reads from the SARS-CoV-2 sequencing data:

bash
fastq-dump -X 1000 -O tutorial/data --split-files SRR15168839
## Read 1000 spots for SRR15168839
## Written 1000 spots for SRR15168839

Now that we have the reads and the reference genome, lets start aligning!

3.3 Bowtie alignment

The first aligner we’re going to use is Bowtie 2 - the second major version of the Bowtie alignment software. The general procedure for read alignment is to first index the reference genome sequence, then align the reads to the genome sequence with the help of the index to speed up mapping. I should warn you now, aligners have a huge number of settings. On the one hand this is great because it means they can be customized for very specific use cases, on the other hand all those settings can be quite bewildering for someone inexperienced. Luckily for you, unless you have a particularly abnormal data set - the defaults work quite well! In fact, Bowtie 2 comes with has handy presets you can choose to automatically set multiple parameters, they even have helpful names like ‘–very-fast’ or ‘–very-sensitive’.

Lets begin by creating a directory to store all the Bowtie 2 output files:

bash
mkdir tutorial/bowtie2

To index a reference sequence, you need to use the bowtie2-build command:

bash
bowtie2-build tutorial/data/MN908947.fasta tutorial/bowtie2/MN908947
## Settings:
##   Output files: "tutorial/bowtie2/MN908947.*.bt2"
##   Line rate: 6 (line is 64 bytes)
##   Lines per side: 1 (side is 64 bytes)
##   Offset rate: 4 (one in 16)
##   FTable chars: 10
##   Strings: unpacked
##   Max bucket size: default
##   Max bucket size, sqrt multiplier: default
##   Max bucket size, len divisor: 4
##   Difference-cover sample period: 1024
##   Endianness: little
##   Actual local endianness: little
##   Sanity checking: disabled
##   Assertions: disabled
##   Random seed: 0
##   Sizeofs: void*:8, int:4, long:8, size_t:8
## Input files DNA, FASTA:
##   tutorial/data/MN908947.fasta
## Building a SMALL index
## Reading reference sizes
##   Time reading reference sizes: 00:00:00
## Calculating joined length
## Writing header
## Reserving space for joined string
## Joining reference sequences
##   Time to join reference sequences: 00:00:00
## bmax according to bmaxDivN setting: 7475
## Using parameters --bmax 5607 --dcv 1024
##   Doing ahead-of-time memory usage test
##   Passed!  Constructing with these parameters: --bmax 5607 --dcv 1024
## Constructing suffix-array element generator
## Building DifferenceCoverSample
##   Building sPrime
##   Building sPrimeOrder
##   V-Sorting samples
##   V-Sorting samples time: 00:00:00
##   Allocating rank array
##   Ranking v-sort output
##   Ranking v-sort output time: 00:00:00
##   Invoking Larsson-Sadakane on ranks
##   Invoking Larsson-Sadakane on ranks time: 00:00:00
##   Sanity-checking and returning
## Building samples
## Reserving space for 12 sample suffixes
## Generating random suffixes
## QSorting 12 sample offsets, eliminating duplicates
## QSorting sample offsets, eliminating duplicates time: 00:00:00
## Multikey QSorting 12 samples
##   (Using difference cover)
##   Multikey QSorting samples time: 00:00:00
## Calculating bucket sizes
## Splitting and merging
##   Splitting and merging time: 00:00:00
## Avg bucket size: 3737 (target: 5606)
## Converting suffix-array elements to index image
## Allocating ftab, absorbFtab
## Entering Ebwt loop
## Getting block 1 of 8
##   Reserving size (5607) for bucket 1
##   Calculating Z arrays for bucket 1
##   Entering block accumulator loop for bucket 1:
##   bucket 1: 10%
##   bucket 1: 20%
##   bucket 1: 30%
##   bucket 1: 40%
##   bucket 1: 50%
##   bucket 1: 60%
##   bucket 1: 70%
##   bucket 1: 80%
##   bucket 1: 90%
##   bucket 1: 100%
##   Sorting block of length 4574 for bucket 1
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 4575 for bucket 1
## Getting block 2 of 8
##   Reserving size (5607) for bucket 2
##   Calculating Z arrays for bucket 2
##   Entering block accumulator loop for bucket 2:
##   bucket 2: 10%
##   bucket 2: 20%
##   bucket 2: 30%
##   bucket 2: 40%
##   bucket 2: 50%
##   bucket 2: 60%
##   bucket 2: 70%
##   bucket 2: 80%
##   bucket 2: 90%
##   bucket 2: 100%
##   Sorting block of length 2668 for bucket 2
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 2669 for bucket 2
## Getting block 3 of 8
##   Reserving size (5607) for bucket 3
##   Calculating Z arrays for bucket 3
##   Entering block accumulator loop for bucket 3:
##   bucket 3: 10%
##   bucket 3: 20%
##   bucket 3: 30%
##   bucket 3: 40%
##   bucket 3: 50%
##   bucket 3: 60%
##   bucket 3: 70%
##   bucket 3: 80%
##   bucket 3: 90%
##   bucket 3: 100%
##   Sorting block of length 3483 for bucket 3
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 3484 for bucket 3
## Getting block 4 of 8
##   Reserving size (5607) for bucket 4
##   Calculating Z arrays for bucket 4
##   Entering block accumulator loop for bucket 4:
##   bucket 4: 10%
##   bucket 4: 20%
##   bucket 4: 30%
##   bucket 4: 40%
##   bucket 4: 50%
##   bucket 4: 60%
##   bucket 4: 70%
##   bucket 4: 80%
##   bucket 4: 90%
##   bucket 4: 100%
##   Sorting block of length 4420 for bucket 4
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 4421 for bucket 4
## Getting block 5 of 8
##   Reserving size (5607) for bucket 5
##   Calculating Z arrays for bucket 5
##   Entering block accumulator loop for bucket 5:
##   bucket 5: 10%
##   bucket 5: 20%
##   bucket 5: 30%
##   bucket 5: 40%
##   bucket 5: 50%
##   bucket 5: 60%
##   bucket 5: 70%
##   bucket 5: 80%
##   bucket 5: 90%
##   bucket 5: 100%
##   Sorting block of length 4901 for bucket 5
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 4902 for bucket 5
## Getting block 6 of 8
##   Reserving size (5607) for bucket 6
##   Calculating Z arrays for bucket 6
##   Entering block accumulator loop for bucket 6:
##   bucket 6: 10%
##   bucket 6: 20%
##   bucket 6: 30%
##   bucket 6: 40%
##   bucket 6: 50%
##   bucket 6: 60%
##   bucket 6: 70%
##   bucket 6: 80%
##   bucket 6: 90%
##   bucket 6: 100%
##   Sorting block of length 3813 for bucket 6
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 3814 for bucket 6
## Getting block 7 of 8
##   Reserving size (5607) for bucket 7
##   Calculating Z arrays for bucket 7
##   Entering block accumulator loop for bucket 7:
##   bucket 7: 10%
##   bucket 7: 20%
##   bucket 7: 30%
##   bucket 7: 40%
##   bucket 7: 50%
##   bucket 7: 60%
##   bucket 7: 70%
##   bucket 7: 80%
##   bucket 7: 90%
##   bucket 7: 100%
##   Sorting block of length 2667 for bucket 7
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 2668 for bucket 7
## Getting block 8 of 8
##   Reserving size (5607) for bucket 8
##   Calculating Z arrays for bucket 8
##   Entering block accumulator loop for bucket 8:
##   bucket 8: 10%
##   bucket 8: 20%
##   bucket 8: 30%
##   bucket 8: 40%
##   bucket 8: 50%
##   bucket 8: 60%
##   bucket 8: 70%
##   bucket 8: 80%
##   bucket 8: 90%
##   bucket 8: 100%
##   Sorting block of length 3370 for bucket 8
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 3371 for bucket 8
## Exited Ebwt loop
## fchr[A]: 0
## fchr[C]: 8954
## fchr[G]: 14446
## fchr[T]: 20309
## fchr[$]: 29903
## Exiting Ebwt::buildToDisk()
## Returning from initFromVector
## Wrote 4204544 bytes to primary EBWT file: tutorial/bowtie2/MN908947.1.bt2.tmp
## Wrote 7480 bytes to secondary EBWT file: tutorial/bowtie2/MN908947.2.bt2.tmp
## Re-opening _in1 and _in2 as input streams
## Returning from Ebwt constructor
## Headers:
##     len: 29903
##     bwtLen: 29904
##     sz: 7476
##     bwtSz: 7476
##     lineRate: 6
##     offRate: 4
##     offMask: 0xfffffff0
##     ftabChars: 10
##     eftabLen: 20
##     eftabSz: 80
##     ftabLen: 1048577
##     ftabSz: 4194308
##     offsLen: 1869
##     offsSz: 7476
##     lineSz: 64
##     sideSz: 64
##     sideBwtSz: 48
##     sideBwtLen: 192
##     numSides: 156
##     numLines: 156
##     ebwtTotLen: 9984
##     ebwtTotSz: 9984
##     color: 0
##     reverse: 0
## Total time for call to driver() for forward index: 00:00:00
## Reading reference sizes
##   Time reading reference sizes: 00:00:00
## Calculating joined length
## Writing header
## Reserving space for joined string
## Joining reference sequences
##   Time to join reference sequences: 00:00:00
##   Time to reverse reference sequence: 00:00:00
## bmax according to bmaxDivN setting: 7475
## Using parameters --bmax 5607 --dcv 1024
##   Doing ahead-of-time memory usage test
##   Passed!  Constructing with these parameters: --bmax 5607 --dcv 1024
## Constructing suffix-array element generator
## Building DifferenceCoverSample
##   Building sPrime
##   Building sPrimeOrder
##   V-Sorting samples
##   V-Sorting samples time: 00:00:00
##   Allocating rank array
##   Ranking v-sort output
##   Ranking v-sort output time: 00:00:00
##   Invoking Larsson-Sadakane on ranks
##   Invoking Larsson-Sadakane on ranks time: 00:00:00
##   Sanity-checking and returning
## Building samples
## Reserving space for 12 sample suffixes
## Generating random suffixes
## QSorting 12 sample offsets, eliminating duplicates
## QSorting sample offsets, eliminating duplicates time: 00:00:00
## Multikey QSorting 12 samples
##   (Using difference cover)
##   Multikey QSorting samples time: 00:00:00
## Calculating bucket sizes
## Splitting and merging
##   Splitting and merging time: 00:00:00
## Split 1, merged 6; iterating...
## Splitting and merging
##   Splitting and merging time: 00:00:00
## Avg bucket size: 3737 (target: 5606)
## Converting suffix-array elements to index image
## Allocating ftab, absorbFtab
## Entering Ebwt loop
## Getting block 1 of 8
##   Reserving size (5607) for bucket 1
##   Calculating Z arrays for bucket 1
##   Entering block accumulator loop for bucket 1:
##   bucket 1: 10%
##   bucket 1: 20%
##   bucket 1: 30%
##   bucket 1: 40%
##   bucket 1: 50%
##   bucket 1: 60%
##   bucket 1: 70%
##   bucket 1: 80%
##   bucket 1: 90%
##   bucket 1: 100%
##   Sorting block of length 2598 for bucket 1
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 2599 for bucket 1
## Getting block 2 of 8
##   Reserving size (5607) for bucket 2
##   Calculating Z arrays for bucket 2
##   Entering block accumulator loop for bucket 2:
##   bucket 2: 10%
##   bucket 2: 20%
##   bucket 2: 30%
##   bucket 2: 40%
##   bucket 2: 50%
##   bucket 2: 60%
##   bucket 2: 70%
##   bucket 2: 80%
##   bucket 2: 90%
##   bucket 2: 100%
##   Sorting block of length 5142 for bucket 2
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 5143 for bucket 2
## Getting block 3 of 8
##   Reserving size (5607) for bucket 3
##   Calculating Z arrays for bucket 3
##   Entering block accumulator loop for bucket 3:
##   bucket 3: 10%
##   bucket 3: 20%
##   bucket 3: 30%
##   bucket 3: 40%
##   bucket 3: 50%
##   bucket 3: 60%
##   bucket 3: 70%
##   bucket 3: 80%
##   bucket 3: 90%
##   bucket 3: 100%
##   Sorting block of length 4371 for bucket 3
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 4372 for bucket 3
## Getting block 4 of 8
##   Reserving size (5607) for bucket 4
##   Calculating Z arrays for bucket 4
##   Entering block accumulator loop for bucket 4:
##   bucket 4: 10%
##   bucket 4: 20%
##   bucket 4: 30%
##   bucket 4: 40%
##   bucket 4: 50%
##   bucket 4: 60%
##   bucket 4: 70%
##   bucket 4: 80%
##   bucket 4: 90%
##   bucket 4: 100%
##   Sorting block of length 4433 for bucket 4
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 4434 for bucket 4
## Getting block 5 of 8
##   Reserving size (5607) for bucket 5
##   Calculating Z arrays for bucket 5
##   Entering block accumulator loop for bucket 5:
##   bucket 5: 10%
##   bucket 5: 20%
##   bucket 5: 30%
##   bucket 5: 40%
##   bucket 5: 50%
##   bucket 5: 60%
##   bucket 5: 70%
##   bucket 5: 80%
##   bucket 5: 90%
##   bucket 5: 100%
##   Sorting block of length 3267 for bucket 5
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 3268 for bucket 5
## Getting block 6 of 8
##   Reserving size (5607) for bucket 6
##   Calculating Z arrays for bucket 6
##   Entering block accumulator loop for bucket 6:
##   bucket 6: 10%
##   bucket 6: 20%
##   bucket 6: 30%
##   bucket 6: 40%
##   bucket 6: 50%
##   bucket 6: 60%
##   bucket 6: 70%
##   bucket 6: 80%
##   bucket 6: 90%
##   bucket 6: 100%
##   Sorting block of length 2880 for bucket 6
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 2881 for bucket 6
## Getting block 7 of 8
##   Reserving size (5607) for bucket 7
##   Calculating Z arrays for bucket 7
##   Entering block accumulator loop for bucket 7:
##   bucket 7: 10%
##   bucket 7: 20%
##   bucket 7: 30%
##   bucket 7: 40%
##   bucket 7: 50%
##   bucket 7: 60%
##   bucket 7: 70%
##   bucket 7: 80%
##   bucket 7: 90%
##   bucket 7: 100%
##   Sorting block of length 5175 for bucket 7
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 5176 for bucket 7
## Getting block 8 of 8
##   Reserving size (5607) for bucket 8
##   Calculating Z arrays for bucket 8
##   Entering block accumulator loop for bucket 8:
##   bucket 8: 10%
##   bucket 8: 20%
##   bucket 8: 30%
##   bucket 8: 40%
##   bucket 8: 50%
##   bucket 8: 60%
##   bucket 8: 70%
##   bucket 8: 80%
##   bucket 8: 90%
##   bucket 8: 100%
##   Sorting block of length 2030 for bucket 8
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 2031 for bucket 8
## Exited Ebwt loop
## fchr[A]: 0
## fchr[C]: 8954
## fchr[G]: 14446
## fchr[T]: 20309
## fchr[$]: 29903
## Exiting Ebwt::buildToDisk()
## Returning from initFromVector
## Wrote 4204544 bytes to primary EBWT file: tutorial/bowtie2/MN908947.rev.1.bt2.tmp
## Wrote 7480 bytes to secondary EBWT file: tutorial/bowtie2/MN908947.rev.2.bt2.tmp
## Re-opening _in1 and _in2 as input streams
## Returning from Ebwt constructor
## Headers:
##     len: 29903
##     bwtLen: 29904
##     sz: 7476
##     bwtSz: 7476
##     lineRate: 6
##     offRate: 4
##     offMask: 0xfffffff0
##     ftabChars: 10
##     eftabLen: 20
##     eftabSz: 80
##     ftabLen: 1048577
##     ftabSz: 4194308
##     offsLen: 1869
##     offsSz: 7476
##     lineSz: 64
##     sideSz: 64
##     sideBwtSz: 48
##     sideBwtLen: 192
##     numSides: 156
##     numLines: 156
##     ebwtTotLen: 9984
##     ebwtTotSz: 9984
##     color: 0
##     reverse: 1
## Total time for backward call to driver() for mirror index: 00:00:00
## Renaming tutorial/bowtie2/MN908947.3.bt2.tmp to tutorial/bowtie2/MN908947.3.bt2
## Renaming tutorial/bowtie2/MN908947.4.bt2.tmp to tutorial/bowtie2/MN908947.4.bt2
## Renaming tutorial/bowtie2/MN908947.1.bt2.tmp to tutorial/bowtie2/MN908947.1.bt2
## Renaming tutorial/bowtie2/MN908947.2.bt2.tmp to tutorial/bowtie2/MN908947.2.bt2
## Renaming tutorial/bowtie2/MN908947.rev.1.bt2.tmp to tutorial/bowtie2/MN908947.rev.1.bt2
## Renaming tutorial/bowtie2/MN908947.rev.2.bt2.tmp to tutorial/bowtie2/MN908947.rev.2.bt2

The command should print many lines of output then quit. When the command completes, the tutorial/bowtie2 directory will contain six new files that all start with MN908947 and end with .1.bt2, .2.bt2, .3.bt2, .4.bt2, .rev.1.bt2, and .rev.2.bt2. These files constitute the index for the Bowtie 2 aligner. The next step is to align the reads. Remember what I said about the aligner having lots of settings? Now would be a good time to have a quick look at all those settings. I know you’re excited to get to the alignment step, but understanding the capabilities of the aligner is important. Go ahead and display the usage information for the bowtie2 command:

bash
bowtie2 -h
## Bowtie 2 version 2.4.5 by Ben Langmead (langmea@cs.jhu.edu, www.cs.jhu.edu/~langmea)
## Usage: 
##   bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r> | --interleaved <i> | -b <bam>} [-S <sam>]
## 
##   <bt2-idx>  Index filename prefix (minus trailing .X.bt2).
##              NOTE: Bowtie 1 and Bowtie 2 indexes are not compatible.
##   <m1>       Files with #1 mates, paired with files in <m2>.
##              Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
##   <m2>       Files with #2 mates, paired with files in <m1>.
##              Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
##   <r>        Files with unpaired reads.
##              Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
##   <i>        Files with interleaved paired-end FASTQ/FASTA reads
##              Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
##   <bam>      Files are unaligned BAM sorted by read name.
##   <sam>      File for SAM output (default: stdout)
## 
##   <m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be
##   specified many times.  E.g. '-U file1.fq,file2.fq -U file3.fq'.
## 
## Options (defaults in parentheses):
## 
##  Input:
##   -q                 query input files are FASTQ .fq/.fastq (default)
##   --tab5             query input files are TAB5 .tab5
##   --tab6             query input files are TAB6 .tab6
##   --qseq             query input files are in Illumina's qseq format
##   -f                 query input files are (multi-)FASTA .fa/.mfa
##   -r                 query input files are raw one-sequence-per-line
##   -F k:<int>,i:<int> query input files are continuous FASTA where reads
##                      are substrings (k-mers) extracted from a FASTA file <s>
##                      and aligned at offsets 1, 1+i, 1+2i ... end of reference
##   -c                 <m1>, <m2>, <r> are sequences themselves, not files
##   -s/--skip <int>    skip the first <int> reads/pairs in the input (none)
##   -u/--upto <int>    stop after first <int> reads/pairs (no limit)
##   -5/--trim5 <int>   trim <int> bases from 5'/left end of reads (0)
##   -3/--trim3 <int>   trim <int> bases from 3'/right end of reads (0)
##   --trim-to [3:|5:]<int> trim reads exceeding <int> bases from either 3' or 5' end
##                      If the read end is not specified then it defaults to 3 (0)
##   --phred33          qualities are Phred+33 (default)
##   --phred64          qualities are Phred+64
##   --int-quals        qualities encoded as space-delimited integers
## 
##  Presets:                 Same as:
##   For --end-to-end:
##    --very-fast            -D 5 -R 1 -N 0 -L 22 -i S,0,2.50
##    --fast                 -D 10 -R 2 -N 0 -L 22 -i S,0,2.50
##    --sensitive            -D 15 -R 2 -N 0 -L 22 -i S,1,1.15 (default)
##    --very-sensitive       -D 20 -R 3 -N 0 -L 20 -i S,1,0.50
## 
##   For --local:
##    --very-fast-local      -D 5 -R 1 -N 0 -L 25 -i S,1,2.00
##    --fast-local           -D 10 -R 2 -N 0 -L 22 -i S,1,1.75
##    --sensitive-local      -D 15 -R 2 -N 0 -L 20 -i S,1,0.75 (default)
##    --very-sensitive-local -D 20 -R 3 -N 0 -L 20 -i S,1,0.50
## 
##  Alignment:
##   -N <int>           max # mismatches in seed alignment; can be 0 or 1 (0)
##   -L <int>           length of seed substrings; must be >3, <32 (22)
##   -i <func>          interval between seed substrings w/r/t read len (S,1,1.15)
##   --n-ceil <func>    func for max # non-A/C/G/Ts permitted in aln (L,0,0.15)
##   --dpad <int>       include <int> extra ref chars on sides of DP table (15)
##   --gbar <int>       disallow gaps within <int> nucs of read extremes (4)
##   --ignore-quals     treat all quality values as 30 on Phred scale (off)
##   --nofw             do not align forward (original) version of read (off)
##   --norc             do not align reverse-complement version of read (off)
##   --no-1mm-upfront   do not allow 1 mismatch alignments before attempting to
##                      scan for the optimal seeded alignments
##   --end-to-end       entire read must align; no clipping (on)
##    OR
##   --local            local alignment; ends might be soft clipped (off)
## 
##  Scoring:
##   --ma <int>         match bonus (0 for --end-to-end, 2 for --local) 
##   --mp <int>         max penalty for mismatch; lower qual = lower penalty (6)
##   --np <int>         penalty for non-A/C/G/Ts in read/ref (1)
##   --rdg <int>,<int>  read gap open, extend penalties (5,3)
##   --rfg <int>,<int>  reference gap open, extend penalties (5,3)
##   --score-min <func> min acceptable alignment score w/r/t read length
##                      (G,20,8 for local, L,-0.6,-0.6 for end-to-end)
## 
##  Reporting:
##   (default)          look for multiple alignments, report best, with MAPQ
##    OR
##   -k <int>           report up to <int> alns per read; MAPQ not meaningful
##    OR
##   -a/--all           report all alignments; very slow, MAPQ not meaningful
## 
##  Effort:
##   -D <int>           give up extending after <int> failed extends in a row (15)
##   -R <int>           for reads w/ repetitive seeds, try <int> sets of seeds (2)
## 
##  Paired-end:
##   -I/--minins <int>  minimum fragment length (0)
##   -X/--maxins <int>  maximum fragment length (500)
##   --fr/--rf/--ff     -1, -2 mates align fw/rev, rev/fw, fw/fw (--fr)
##   --no-mixed         suppress unpaired alignments for paired reads
##   --no-discordant    suppress discordant alignments for paired reads
##   --dovetail         concordant when mates extend past each other
##   --no-contain       not concordant when one mate alignment contains other
##   --no-overlap       not concordant when mates overlap at all
## 
##  BAM:
##   --align-paired-reads
##                      Bowtie2 will, by default, attempt to align unpaired BAM reads.
##                      Use this option to align paired-end reads instead.
##   --preserve-tags    Preserve tags from the original BAM record by
##                      appending them to the end of the corresponding SAM output.
## 
##  Output:
##   -t/--time          print wall-clock time taken by search phases
##   --un <path>        write unpaired reads that didn't align to <path>
##   --al <path>        write unpaired reads that aligned at least once to <path>
##   --un-conc <path>   write pairs that didn't align concordantly to <path>
##   --al-conc <path>   write pairs that aligned concordantly at least once to <path>
##     (Note: for --un, --al, --un-conc, or --al-conc, add '-gz' to the option name, e.g.
##     --un-gz <path>, to gzip compress output, or add '-bz2' to bzip2 compress output.)
##   --quiet            print nothing to stderr except serious errors
##   --met-file <path>  send metrics to file at <path> (off)
##   --met-stderr       send metrics to stderr (off)
##   --met <int>        report internal counters & metrics every <int> secs (1)
##   --no-unal          suppress SAM records for unaligned reads
##   --no-head          suppress header lines, i.e. lines starting with @
##   --no-sq            suppress @SQ header lines
##   --rg-id <text>     set read group id, reflected in @RG line and RG:Z: opt field
##   --rg <text>        add <text> ("lab:value") to @RG line of SAM header.
##                      Note: @RG line only printed when --rg-id is set.
##   --omit-sec-seq     put '*' in SEQ and QUAL fields for secondary alignments.
##   --sam-no-qname-trunc
##                      Suppress standard behavior of truncating readname at first whitespace 
##                      at the expense of generating non-standard SAM.
##   --xeq              Use '='/'X', instead of 'M,' to specify matches/mismatches in SAM record.
##   --soft-clipped-unmapped-tlen
##                      Exclude soft-clipped bases when reporting TLEN
##   --sam-append-comment
##                      Append FASTA/FASTQ comment to SAM record
## 
##  Performance:
##   -p/--threads <int> number of alignment threads to launch (1)
##   --reorder          force SAM output order to match order of input reads
##   --mm               use memory-mapped I/O for index; many 'bowtie's can share
## 
##  Other:
##   --qc-filter        filter out reads that are bad according to QSEQ filter
##   --seed <int>       seed for random number generator (0)
##   --non-deterministic
##                      seed rand. gen. arbitrarily instead of using read attributes
##   --version          print version information and quit
##   -h/--help          print this usage message

Now that you’ve skimmed the usage information, lets run the aligner! We need to provide the index of the reference genome, the FASTQ files containing the sequencing reads, and also the name of file to write the results:

bash
bowtie2 -x tutorial/bowtie2/MN908947 -1 tutorial/data/SRR15168839_1.fastq -2 tutorial/data/SRR15168839_2.fastq > tutorial/bowtie2/SRR15168839.sam
## 1000 reads; of these:
##   1000 (100.00%) were paired; of these:
##     910 (91.00%) aligned concordantly 0 times
##     90 (9.00%) aligned concordantly exactly 1 time
##     0 (0.00%) aligned concordantly >1 times
##     ----
##     910 pairs aligned concordantly 0 times; of these:
##       870 (95.60%) aligned discordantly 1 time
##     ----
##     40 pairs aligned 0 times concordantly or discordantly; of these:
##       80 mates make up the pairs; of these:
##         79 (98.75%) aligned 0 times
##         1 (1.25%) aligned exactly 1 time
##         0 (0.00%) aligned >1 times
## 96.05% overall alignment rate

This runs the Bowtie 2 aligner, which aligns a set of paired-end reads to the SARS-CoV-2 reference genome using the index generated in the previous step. The alignment results are written in SAM format to the file tutorial/bowtie2/SRR15168839.sam, and a short alignment summary is written to the console.

The summary provides some useful information on the mapping rate. In the example above, it tells you that of the 1000 paired-end reads, only 9% of them aligned ‘concordantly’. This term is used to label read pairs which align with the expected insert size in the expected orientation. For more details, see the section of the manual which describes the difference between concordant and discordant pairs. Of those reads which aligned ‘discordantly’, nearly 96% of them were aligned. Finally, the overall alignment rate for this data is 96%.

The SAM file format is the de-facto standard for reporting of alignments to a reference genome. These files can be manipulated using the SAMtools software package. We will learn more about SAMtools in the next workshop, so for now just follow the commands listed below.

Print SAM header only:

bash
samtools view -H tutorial/bowtie2/SRR15168839.sam
## @HD  VN:1.5  SO:unsorted GO:query
## @SQ  SN:MN908947.3   LN:29903
## @PG  ID:bowtie2  PN:bowtie2  VN:2.4.5    CL:"/opt/miniconda3/envs/alignment/bin/bowtie2-align-s --wrapper basic-0 -x tutorial/bowtie2/MN908947 -1 tutorial/data/SRR15168839_1.fastq -2 tutorial/data/SRR15168839_2.fastq"
## @PG  ID:samtools PN:samtools PP:bowtie2  VN:1.15.1   CL:samtools view -H tutorial/bowtie2/SRR15168839.sam

Print SAM alignment records only:

bash
# Output the first few lines of the file
samtools view tutorial/bowtie2/SRR15168839.sam | head
## SRR15168839.1    97  MN908947.3  16442   23  142M4I5M    =   16435   -154    GCTATTATTGTAAATCACATAAACCGCCCATTAGTTTTCCATTGTGTGCTAATGGACAAGTTTTTGGTTTATATAAAAATACATGTGTTGGTAGCGATAATGTTACTGACTTTAATGCAATTGCAACATGTGACTGGACAAAGATCGGAAG ABBBBFFFFFFFFGGGGGGGGGHHHGGGGGHHHHHHHHHHHHHHGHHHHHHHHHFHEHHHHHHHGGGHHHHHHHHHHHHHHHHHHHHHHHHAGHHEGHGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGHHHHHHFHHHHHHHHHGGGD AS:i:-37    XN:i:0  XM:i:4  XO:i:1  XG:i:4  NM:i:8  MD:Z:25A116T1C0T1   YS:i:-39    YT:Z:DP
## SRR15168839.1    145 MN908947.3  16435   23  7M3I141M    =   16442   154 CTTCCGATCTGCTATTATTGTAAATCACATAAACCGCCCATTAGTTTTCCATTGTGTGCTAATGGACAAGTTTTTGGTTTATATAAAAATACATGTGTTGGTAGCGATAATGTTACTGACTTTAATGCAATTGCAACATGTGACTGGACAA ?GGGHHHHFGFHHHHHHHFHHGHGHHFHHHGGGGGHGHHHHHHHHHHHHHHEGHHHHHHHHHHHHHHHHGHHHHHHGHHHHHHHHHHHHHHHHHGGGGHHGGHGHHHHHHHHHHHHHHHHHHHHHGHHHGGGGGGGGGGFFFFFFFCCCBB AS:i:-39    XN:i:0  XM:i:5  XO:i:1  XG:i:3  NM:i:8  MD:Z:0G0G1A0T27A115 YS:i:-37    YT:Z:DP
## SRR15168839.2    97  MN908947.3  21305   42  151M    =   21304   -152    GCGAACAAATAGATGGTTATGTCATGCATGCAAATTACATATTTTGGAGGAATACAAACCCAATTCAGTTGTCTTCCTATTCTTTATTTGACATGAGTAAATTTCCCCTTAAATTAAGGGGTACTGCTGTTATGTCTTTAAAAGAAGGTCA CDDDDDDFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHGGHGGHHHHHHHHGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGFGHHHHHHHHHHHIHHHHHHHHHGHHHHHH AS:i:-5 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:58T92  YS:i:-10    YT:Z:DP
## SRR15168839.2    145 MN908947.3  21304   42  151M    =   21305   152 TGCGAACAAATAGATGGTTATGTCATGCATGCAAATTACATATTTTGGAGGAATACAAACCCAATTCAGTTGTCTTCCTATTCTTTATTTGACATGAGTAAATTTCCCCTTAAATTAAGGGGTACTGCTGTTATGTCTTTAAAAGAAGGTC GGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGGGHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHIHIHHHHGHHHHGGGHHHHHHHHHHHGHHHHHHHHHHHGGGGGGGGGGFFFFFFFBBABA AS:i:-10    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:0C58T91    YS:i:-5 YT:Z:DP
## SRR15168839.3    99  MN908947.3  28222   42  49M1D102M   =   28285   214 AGTATCATGACGTTCGTGTTGTTTTAGATTTCATCTAAACGAACAAACTAAATGTCTCTAAATGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAACCAGAATGGAGAACGCAGTG AAAAAFFFFFFCGGGGEFGGGGHHHHGHHHHHHHHHHHHHGGGGGHHHHHHHHHHHHHHHHHHHGHHHGGGGHHHHHHHGGGGHHHHHHGGGGGGGHHGHHHGGHGGHHHGHHHHHHHHHHHHHHHHGHHHHHHHHHGHHHHHHHGGGGGH AS:i:-23    XN:i:0  XM:i:3  XO:i:1  XG:i:1  NM:i:4  MD:Z:49^A8G0A0T91   YS:i:0  YT:Z:CP
## SRR15168839.3    147 MN908947.3  28285   42  151M    =   28222   -214    TGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAACCAGAATGGAGAACGCAGTGGGGCGCGATCAAAACAACGTCGGCCCCAAGGTTTACCCAATAATACTGCGTCTTGGTTCACC HHGDGGGHGHHHGGGGHHHHHHGFCGGGGHHGHHHGHHHHHHHHHGGGHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHGGGGGHHHGGGGGGGGGHGGHHHHFGFEGGGGGGGHHHHHHGHHHGHGHHHGGGGGGGGGGFFBFBFFBBBBB AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:151    YS:i:-23    YT:Z:CP
## SRR15168839.4    97  MN908947.3  17888   42  151M    =   17887   -152    AAACAGCTCACTCTTGTAATGTAAACAGATTTAATGTTGCTATTACCAGAGCAAAAGTAGGCATACTTTGCATAATGTCTGATAGAGACCTTTATGACAAGTTGCAACTTACAAGTCTTGAAATTCCACGTAGGAATGTGGCAACTTTACA BBBBBFFFFFFFGGGGDGGGGGHHGHHHFHHHFHHHHHHHHHHHHHHHHHHHHHHGHGHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHHGHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHGHFHHFHFHHHHHHHHHF AS:i:-5 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:107T43 YS:i:-10    YT:Z:DP
## SRR15168839.4    145 MN908947.3  17887   42  151M    =   17888   152 TAAACAGCTCACTCTTGTAATGTAAACAGATTTAATGTTGCTATTACCAGAGCAAAAGTAGGCATACTTTGCATAATGTCTGATAGAGACCTTTATGACAAGTTGCAACTTACAAGTCTTGAAATTCCACGTAGGAATGTGGCAACTTTAC HHHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHGHHHHHGHGHHHHHHIIIHHFHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGGHGGGGGGGGGGFFFFFFFCCAAA AS:i:-10    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:0G107T42   YS:i:-5 YT:Z:DP
## SRR15168839.5    97  MN908947.3  28286   42  151M    =   28285   -152    GGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAACCAGAATGGAGAACGCAGTGGGGCGCGATCAAAACAACGTCGGCCCCAAGGTTTACCCAATAATACTGCGTCTTGGTTCACCA BCCCCCCCCFCFGGGGGGGGGGHHHHGGGGGGGHHGHHHGGGFGHHHGHHHHHHHHHHHHHHHHGHHHHHHHHHHHHHGHHHGGGGGHHHGGGGGGGGGHHHHHHHGHHGGGGGGGGHGGHHHHHHHHHHHHHHHHHGGGGGHHHHHHHHG AS:i:-5 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:150G0  YS:i:0  YT:Z:DP
## SRR15168839.5    145 MN908947.3  28285   42  151M    =   28286   152 TGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAACCAGAATGGAGAACGCAGTGGGGCGCGATCAAAACAACGTCGGCCCCAAGGTTTACCCAATAATACTGCGTCTTGGTTCACC GHGGGGGHHHHHGGGGHHHHHHGFGGGGGHHGHHHGHHHHHHHHHGGGHHHHHHHHGHHHHHHHHHGHHHHHHHHHHHGGGGGHHHGGGGGGGGGHHGHHHHGHGGGGGGGGGHHHHHHHHHHGHHHHHGGGGGGGGGGFFDFCFFACBCA AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:151    YS:i:-5 YT:Z:DP
## samtools view: writing to standard output failed: Broken pipe
## samtools view: error closing standard output: -1

Calculate read alignment statistics:

bash
samtools flagstat tutorial/bowtie2/SRR15168839.sam
## 2000 + 0 in total (QC-passed reads + QC-failed reads)
## 2000 + 0 primary
## 0 + 0 secondary
## 0 + 0 supplementary
## 0 + 0 duplicates
## 0 + 0 primary duplicates
## 1921 + 0 mapped (96.05% : N/A)
## 1921 + 0 primary mapped (96.05% : N/A)
## 2000 + 0 paired in sequencing
## 1000 + 0 read1
## 1000 + 0 read2
## 180 + 0 properly paired (9.00% : N/A)
## 1920 + 0 with itself and mate mapped
## 1 + 0 singletons (0.05% : N/A)
## 0 + 0 with mate mapped to a different chr
## 0 + 0 with mate mapped to a different chr (mapQ>=5)

3.4 BWA alignment

The next aligner we’re going to use is BWA - which stands for Burrows-Wheeler aligner. No prizes for guessing which indexing strategy this aligner uses! As mentioned in the introduction, BWA provides three different alignment algorithms. In practice, most people use the BWA-MEM algorithm as it is generally faster and more accurate. The general procedure for BWA is very similar to Bowtie 2 so we won’t go into as much detail about the individual steps as we did previously.

Lets begin by creating a directory to store all the BWA output files:

bash
mkdir tutorial/bwa

As before, the next thing we need to do is index the reference sequence. To do this we use the bwa index command:

bash
bwa index -p tutorial/bwa/MN908947 tutorial/data/MN908947.fasta
## [bwa_index] Pack FASTA... 0.00 sec
## [bwa_index] Construct BWT for the packed sequence...
## [bwa_index] 0.00 seconds elapse.
## [bwa_index] Update BWT... 0.00 sec
## [bwa_index] Pack forward-only FASTA... 0.00 sec
## [bwa_index] Construct SA from BWT and Occ... 0.00 sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa index -p tutorial/bwa/MN908947 tutorial/data/MN908947.fasta
## [main] Real time: 0.007 sec; CPU: 0.025 sec

The command should print some lines of output then quit. When the command completes, the tutorial/bwa directory will contain five new files that all start with MN908947 and end with .amb, .ann, .bwt, .pac, and .sa. This constitutes the index for the BWA aligner.

Display usage information for the bwa mem command:

bash
bwa mem
## 
## Usage: bwa mem [options] <idxbase> <in1.fq> [in2.fq]
## 
## Algorithm options:
## 
##        -t INT        number of threads [1]
##        -k INT        minimum seed length [19]
##        -w INT        band width for banded alignment [100]
##        -d INT        off-diagonal X-dropoff [100]
##        -r FLOAT      look for internal seeds inside a seed longer than {-k} * FLOAT [1.5]
##        -y INT        seed occurrence for the 3rd round seeding [20]
##        -c INT        skip seeds with more than INT occurrences [500]
##        -D FLOAT      drop chains shorter than FLOAT fraction of the longest overlapping chain [0.50]
##        -W INT        discard a chain if seeded bases shorter than INT [0]
##        -m INT        perform at most INT rounds of mate rescues for each read [50]
##        -S            skip mate rescue
##        -P            skip pairing; mate rescue performed unless -S also in use
## 
## Scoring options:
## 
##        -A INT        score for a sequence match, which scales options -TdBOELU unless overridden [1]
##        -B INT        penalty for a mismatch [4]
##        -O INT[,INT]  gap open penalties for deletions and insertions [6,6]
##        -E INT[,INT]  gap extension penalty; a gap of size k cost '{-O} + {-E}*k' [1,1]
##        -L INT[,INT]  penalty for 5'- and 3'-end clipping [5,5]
##        -U INT        penalty for an unpaired read pair [17]
## 
##        -x STR        read type. Setting -x changes multiple parameters unless overridden [null]
##                      pacbio: -k17 -W40 -r10 -A1 -B1 -O1 -E1 -L0  (PacBio reads to ref)
##                      ont2d: -k14 -W20 -r10 -A1 -B1 -O1 -E1 -L0  (Oxford Nanopore 2D-reads to ref)
##                      intractg: -B9 -O16 -L5  (intra-species contigs to ref)
## 
## Input/output options:
## 
##        -p            smart pairing (ignoring in2.fq)
##        -R STR        read group header line such as '@RG\tID:foo\tSM:bar' [null]
##        -H STR/FILE   insert STR to header if it starts with @; or insert lines in FILE [null]
##        -o FILE       sam file to output results to [stdout]
##        -j            treat ALT contigs as part of the primary assembly (i.e. ignore <idxbase>.alt file)
##        -5            for split alignment, take the alignment with the smallest coordinate as primary
##        -q            don't modify mapQ of supplementary alignments
##        -K INT        process INT input bases in each batch regardless of nThreads (for reproducibility) []
## 
##        -v INT        verbosity level: 1=error, 2=warning, 3=message, 4+=debugging [3]
##        -T INT        minimum score to output [30]
##        -h INT[,INT]  if there are <INT hits with score >80% of the max score, output all in XA [5,200]
##        -a            output all alignments for SE or unpaired PE
##        -C            append FASTA/FASTQ comment to SAM output
##        -V            output the reference FASTA header in the XR tag
##        -Y            use soft clipping for supplementary alignments
##        -M            mark shorter split hits as secondary
## 
##        -I FLOAT[,FLOAT[,INT[,INT]]]
##                      specify the mean, standard deviation (10% of the mean if absent), max
##                      (4 sigma from the mean if absent) and min of the insert size distribution.
##                      FR orientation only. [inferred]
## 
## Note: Please read the man page for detailed description of the command line and options.

Run the BWA alignment command:

bash
bwa mem tutorial/bwa/MN908947 tutorial/data/SRR15168839_1.fastq tutorial/data/SRR15168839_2.fastq > tutorial/bwa/SRR15168839.sam
## [M::bwa_idx_load_from_disk] read 0 ALT contigs
## [M::process] read 2000 sequences (302000 bp)...
## [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 996, 0, 0)
## [M::mem_pestat] skip orientation FF as there are not enough pairs
## [M::mem_pestat] analyzing insert size distribution for orientation FR...
## [M::mem_pestat] (25, 50, 75) percentile: (142, 148, 149)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (128, 163)
## [M::mem_pestat] mean and std.dev: (145.46, 5.30)
## [M::mem_pestat] low and high boundaries for proper pairs: (121, 170)
## [M::mem_pestat] skip orientation RF as there are not enough pairs
## [M::mem_pestat] skip orientation RR as there are not enough pairs
## [M::mem_process_seqs] Processed 2000 reads in 0.049 CPU sec, 0.049 real sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa mem tutorial/bwa/MN908947 tutorial/data/SRR15168839_1.fastq tutorial/data/SRR15168839_2.fastq
## [main] Real time: 0.054 sec; CPU: 0.057 sec

Print SAM header only:

bash
samtools view -H tutorial/bwa/SRR15168839.sam
## @SQ  SN:MN908947.3   LN:29903
## @PG  ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem tutorial/bwa/MN908947 tutorial/data/SRR15168839_1.fastq tutorial/data/SRR15168839_2.fastq
## @PG  ID:samtools PN:samtools PP:bwa  VN:1.15.1   CL:samtools view -H tutorial/bwa/SRR15168839.sam

Print SAM alignment records only:

bash
# Output the first few lines of the file
samtools view tutorial/bwa/SRR15168839.sam | head
## SRR15168839.1    99  MN908947.3  16442   60  142M9S  =   16442   141 GCTATTATTGTAAATCACATAAACCGCCCATTAGTTTTCCATTGTGTGCTAATGGACAAGTTTTTGGTTTATATAAAAATACATGTGTTGGTAGCGATAATGTTACTGACTTTAATGCAATTGCAACATGTGACTGGACAAAGATCGGAAG ABBBBFFFFFFFFGGGGGGGGGHHHGGGGGHHHHHHHHHHHHHHGHHHHHHHHHFHEHHHHHHHGGGHHHHHHHHHHHHHHHHHHHHHHHHAGHHEGHGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGHHHHHHFHHHHHHHHHGGGD NM:i:1  MD:Z:25A116 MC:Z:10S141M    AS:i:137    XS:i:0
## SRR15168839.1    147 MN908947.3  16442   60  10S141M =   16442   -141    CTTCCGATCTGCTATTATTGTAAATCACATAAACCGCCCATTAGTTTTCCATTGTGTGCTAATGGACAAGTTTTTGGTTTATATAAAAATACATGTGTTGGTAGCGATAATGTTACTGACTTTAATGCAATTGCAACATGTGACTGGACAA ?GGGHHHHFGFHHHHHHHFHHGHGHHFHHHGGGGGHGHHHHHHHHHHHHHHEGHHHHHHHHHHHHHHHHGHHHHHHGHHHHHHHHHHHHHHHHHGGGGHHGGHGHHHHHHHHHHHHHHHHHHHHHGHHHGGGGGGGGGGFFFFFFFCCCBB NM:i:1  MD:Z:25A115 MC:Z:142M9S AS:i:136    XS:i:0
## SRR15168839.2    99  MN908947.3  21305   60  151M    =   21304   150 GCGAACAAATAGATGGTTATGTCATGCATGCAAATTACATATTTTGGAGGAATACAAACCCAATTCAGTTGTCTTCCTATTCTTTATTTGACATGAGTAAATTTCCCCTTAAATTAAGGGGTACTGCTGTTATGTCTTTAAAAGAAGGTCA CDDDDDDFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHGGHGGHHHHHHHHGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGFGHHHHHHHHHHHIHHHHHHHHHGHHHHHH NM:i:1  MD:Z:58T92  MC:Z:151M   AS:i:146    XS:i:0
## SRR15168839.2    147 MN908947.3  21304   60  151M    =   21305   -150    TGCGAACAAATAGATGGTTATGTCATGCATGCAAATTACATATTTTGGAGGAATACAAACCCAATTCAGTTGTCTTCCTATTCTTTATTTGACATGAGTAAATTTCCCCTTAAATTAAGGGGTACTGCTGTTATGTCTTTAAAAGAAGGTC GGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGGGHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHIHIHHHHGHHHHGGGHHHHHHHHHHHGHHHHHHHHHHHGGGGGGGGGGFFFFFFFBBABA NM:i:2  MD:Z:0C58T91    MC:Z:151M   AS:i:145    XS:i:0
## SRR15168839.3    97  MN908947.3  28222   60  49M1D102M   =   28285   214 AGTATCATGACGTTCGTGTTGTTTTAGATTTCATCTAAACGAACAAACTAAATGTCTCTAAATGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAACCAGAATGGAGAACGCAGTG AAAAAFFFFFFCGGGGEFGGGGHHHHGHHHHHHHHHHHHHGGGGGHHHHHHHHHHHHHHHHHHHGHHHGGGGHHHHHHHGGGGHHHHHHGGGGGGGHHGHHHGGHGGHHHGHHHHHHHHHHHHHHHHGHHHHHHHHHGHHHHHHHGGGGGH NM:i:4  MD:Z:49^A8G0A0T91   MC:Z:151M   AS:i:129    XS:i:0
## SRR15168839.3    145 MN908947.3  28285   60  151M    =   28222   -214    TGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAACCAGAATGGAGAACGCAGTGGGGCGCGATCAAAACAACGTCGGCCCCAAGGTTTACCCAATAATACTGCGTCTTGGTTCACC HHGDGGGHGHHHGGGGHHHHHHGFCGGGGHHGHHHGHHHHHHHHHGGGHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHGGGGGHHHGGGGGGGGGHGGHHHHFGFEGGGGGGGHHHHHHGHHHGHGHHHGGGGGGGGGGFFBFBFFBBBBB NM:i:0  MD:Z:151    MC:Z:49M1D102M  AS:i:151    XS:i:0
## SRR15168839.4    99  MN908947.3  17888   60  151M    =   17887   150 AAACAGCTCACTCTTGTAATGTAAACAGATTTAATGTTGCTATTACCAGAGCAAAAGTAGGCATACTTTGCATAATGTCTGATAGAGACCTTTATGACAAGTTGCAACTTACAAGTCTTGAAATTCCACGTAGGAATGTGGCAACTTTACA BBBBBFFFFFFFGGGGDGGGGGHHGHHHFHHHFHHHHHHHHHHHHHHHHHHHHHHGHGHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHHGHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHGHFHHFHFHHHHHHHHHF NM:i:1  MD:Z:107T43 MC:Z:151M   AS:i:146    XS:i:0
## SRR15168839.4    147 MN908947.3  17887   60  151M    =   17888   -150    TAAACAGCTCACTCTTGTAATGTAAACAGATTTAATGTTGCTATTACCAGAGCAAAAGTAGGCATACTTTGCATAATGTCTGATAGAGACCTTTATGACAAGTTGCAACTTACAAGTCTTGAAATTCCACGTAGGAATGTGGCAACTTTAC HHHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHGHHHHHGHGHHHHHHIIIHHFHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGGHGGGGGGGGGGFFFFFFFCCAAA NM:i:2  MD:Z:0G107T42   MC:Z:151M   AS:i:145    XS:i:0
## SRR15168839.5    99  MN908947.3  28286   60  151M    =   28285   150 GGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAACCAGAATGGAGAACGCAGTGGGGCGCGATCAAAACAACGTCGGCCCCAAGGTTTACCCAATAATACTGCGTCTTGGTTCACCA BCCCCCCCCFCFGGGGGGGGGGHHHHGGGGGGGHHGHHHGGGFGHHHGHHHHHHHHHHHHHHHHGHHHHHHHHHHHHHGHHHGGGGGHHHGGGGGGGGGHHHHHHHGHHGGGGGGGGHGGHHHHHHHHHHHHHHHHHGGGGGHHHHHHHHG NM:i:1  MD:Z:150G0  MC:Z:151M   AS:i:150    XS:i:0
## SRR15168839.5    147 MN908947.3  28285   60  151M    =   28286   -150    TGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAACCAGAATGGAGAACGCAGTGGGGCGCGATCAAAACAACGTCGGCCCCAAGGTTTACCCAATAATACTGCGTCTTGGTTCACC GHGGGGGHHHHHGGGGHHHHHHGFGGGGGHHGHHHGHHHHHHHHHGGGHHHHHHHHGHHHHHHHHHGHHHHHHHHHHHGGGGGHHHGGGGGGGGGHHGHHHHGHGGGGGGGGGHHHHHHHHHHGHHHHHGGGGGGGGGGFFDFCFFACBCA NM:i:0  MD:Z:151    MC:Z:151M   AS:i:151    XS:i:0
## samtools view: writing to standard output failed: Broken pipe
## samtools view: error closing standard output: -1

Calculate read alignment statistics:

bash
samtools flagstat tutorial/bwa/SRR15168839.sam
## 2006 + 0 in total (QC-passed reads + QC-failed reads)
## 2000 + 0 primary
## 0 + 0 secondary
## 6 + 0 supplementary
## 0 + 0 duplicates
## 0 + 0 primary duplicates
## 1991 + 0 mapped (99.25% : N/A)
## 1985 + 0 primary mapped (99.25% : N/A)
## 2000 + 0 paired in sequencing
## 1000 + 0 read1
## 1000 + 0 read2
## 1864 + 0 properly paired (93.20% : N/A)
## 1982 + 0 with itself and mate mapped
## 3 + 0 singletons (0.15% : N/A)
## 0 + 0 with mate mapped to a different chr
## 0 + 0 with mate mapped to a different chr (mapQ>=5)

Before we finish, one thing you might have noticed is the difference in ‘properly paired’ alignment rates between the Bowtie 2 and BWA aligners. Unfortunately, this is not simple to explain. The ‘properly paired’ definition is not standard and each aligner uses different rules to define whether a pair of reads is properly paired or not. To know why these rates are different you will have to look up the Bowtie 2 and BWA definitions. We’ll leave that as a puzzle you to solve, if you’re curious.

4 Exercises

The exercises below are designed to strengthen your knowledge of using Bowtie 2 and BWA to align short reads to a reference genome. The solution to each problem is blurred, only after attempting to solve the problem yourself should you look at the solution. Should you need any help, please ask one of the instructors.

Create a directory to store the output files from each exercise:

bash
mkdir exercises
mkdir exercises/ebola
mkdir exercises/hiv
mkdir exercises/dengue

4.1 Ebola

Ebola virus disease (EVD), formerly known as Ebola haemorrhagic fever, is a severe, often fatal illness affecting humans and other primates. The virus is transmitted to people from wild animals (such as fruit bats, porcupines and non-human primates) and then spreads in the human population through direct contact with the blood, secretions, organs or other bodily fluids of infected people, and with surfaces and materials (e.g. bedding, clothing) contaminated with these fluids.
World Health Organization
  1. Use the efetch command from EDirect to download and save the Ebola reference genome (AF086833):
bash
efetch -db nuccore -id AF086833 -format fasta > exercises/ebola/AF086833.fasta
  1. Use the fastq-dump command from the SRA Toolkit to download 10,000 paired-end reads from a sequencing run (SRR1972739) of an Ebola sample:
bash
fastq-dump -X 10000 -O exercises/ebola/ --split-files SRR1972739
## Read 10000 spots for SRR1972739
## Written 10000 spots for SRR1972739
  1. Use the SeqKit package to list the name, GC content, and length of the genome:
bash
seqkit fx2tab -n -g -l exercises/ebola/AF086833.fasta
## AF086833.2 Ebola virus - Mayinga, Zaire, 1976, complete genome   18959   41.07
  1. Index the reference genome using the BWA aligner:
bash
bwa index -p exercises/ebola/AF086833 exercises/ebola/AF086833.fasta
## [bwa_index] Pack FASTA... 0.00 sec
## [bwa_index] Construct BWT for the packed sequence...
## [bwa_index] 0.00 seconds elapse.
## [bwa_index] Update BWT... 0.00 sec
## [bwa_index] Pack forward-only FASTA... 0.00 sec
## [bwa_index] Construct SA from BWT and Occ... 0.00 sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa index -p exercises/ebola/AF086833 exercises/ebola/AF086833.fasta
## [main] Real time: 0.005 sec; CPU: 0.022 sec
  1. Align the sequencing reads to the Ebola genome use the BWA aligner:
bash
bwa mem exercises/ebola/AF086833 exercises/ebola/SRR1972739_1.fastq exercises/ebola/SRR1972739_2.fastq > exercises/ebola/SRR1972739.sam
## [M::bwa_idx_load_from_disk] read 0 ALT contigs
## [M::process] read 20000 sequences (2020000 bp)...
## [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (738, 5739, 6, 663)
## [M::mem_pestat] analyzing insert size distribution for orientation FF...
## [M::mem_pestat] (25, 50, 75) percentile: (74, 114, 180)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 392)
## [M::mem_pestat] mean and std.dev: (131.35, 75.24)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 498)
## [M::mem_pestat] analyzing insert size distribution for orientation FR...
## [M::mem_pestat] (25, 50, 75) percentile: (174, 220, 276)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 480)
## [M::mem_pestat] mean and std.dev: (227.70, 77.37)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 582)
## [M::mem_pestat] skip orientation RF as there are not enough pairs
## [M::mem_pestat] analyzing insert size distribution for orientation RR...
## [M::mem_pestat] (25, 50, 75) percentile: (77, 122, 175)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 371)
## [M::mem_pestat] mean and std.dev: (131.18, 71.99)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 469)
## [M::mem_process_seqs] Processed 20000 reads in 1.011 CPU sec, 1.011 real sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa mem exercises/ebola/AF086833 exercises/ebola/SRR1972739_1.fastq exercises/ebola/SRR1972739_2.fastq
## [main] Real time: 1.030 sec; CPU: 1.031 sec
  1. Use the samtools flagstat command to report the alignment statistics:
bash
samtools flagstat exercises/ebola/SRR1972739.sam
## 20740 + 0 in total (QC-passed reads + QC-failed reads)
## 20000 + 0 primary
## 0 + 0 secondary
## 740 + 0 supplementary
## 0 + 0 duplicates
## 0 + 0 primary duplicates
## 15279 + 0 mapped (73.67% : N/A)
## 14539 + 0 primary mapped (72.69% : N/A)
## 20000 + 0 paired in sequencing
## 10000 + 0 read1
## 10000 + 0 read2
## 14480 + 0 properly paired (72.40% : N/A)
## 14528 + 0 with itself and mate mapped
## 11 + 0 singletons (0.05% : N/A)
## 0 + 0 with mate mapped to a different chr
## 0 + 0 with mate mapped to a different chr (mapQ>=5)

4.2 HIV

Human immunodeficiency virus (HIV) is an infection that attacks the body’s immune system, specifically the white blood cells called CD4 cells. HIV destroys these CD4 cells, weakening a person’s immunity against opportunistic infections, such as tuberculosis and fungal infections, severe bacterial infections and some cancers.
World Health Organization
  1. Use the efetch command from EDirect to download and save the HIV reference genome (AF033819):
bash
efetch -db nuccore -id AF033819 -format fasta > exercises/hiv/AF033819.fasta
  1. Use the esearch and efetch commands to download and save the run information from a public HIV sequencing project (PRJNA541016):
bash
esearch -db sra -query PRJNA541016 | efetch -format runinfo > exercises/hiv/runinfo.csv
  1. Create a file which contains the first five run accession numbers, each accession number should be on its own line:
bash
cut -d ',' -f 1 exercises/hiv/runinfo.csv | tail -n +2 | head -n 5 > exercises/hiv/runids.txt
  1. Use the fastq-dump command to download 1000 single-end reads from each of the runs listed in the project. Instead of manually creating the command for each run accession number, use a for loop to read each run accession number from the file created in Q3:
bash
while read RUN; do

  fastq-dump -X 1000 -O exercises/hiv ${RUN}

done < exercises/hiv/runids.txt
## Read 1000 spots for SRR9008461
## Written 1000 spots for SRR9008461
## Read 1000 spots for SRR9008494
## Written 1000 spots for SRR9008494
## Read 1000 spots for SRR9008493
## Written 1000 spots for SRR9008493
## Read 1000 spots for SRR9008492
## Written 1000 spots for SRR9008492
## Read 1000 spots for SRR9008491
## Written 1000 spots for SRR9008491
  1. Use the Bowtie 2 aligner to index the HIV genome:
bash
bowtie2-build exercises/hiv/AF033819.fasta exercises/hiv/AF033819
## Settings:
##   Output files: "exercises/hiv/AF033819.*.bt2"
##   Line rate: 6 (line is 64 bytes)
##   Lines per side: 1 (side is 64 bytes)
##   Offset rate: 4 (one in 16)
##   FTable chars: 10
##   Strings: unpacked
##   Max bucket size: default
##   Max bucket size, sqrt multiplier: default
##   Max bucket size, len divisor: 4
##   Difference-cover sample period: 1024
##   Endianness: little
##   Actual local endianness: little
##   Sanity checking: disabled
##   Assertions: disabled
##   Random seed: 0
##   Sizeofs: void*:8, int:4, long:8, size_t:8
## Input files DNA, FASTA:
##   exercises/hiv/AF033819.fasta
## Building a SMALL index
## Reading reference sizes
##   Time reading reference sizes: 00:00:00
## Calculating joined length
## Writing header
## Reserving space for joined string
## Joining reference sequences
##   Time to join reference sequences: 00:00:00
## bmax according to bmaxDivN setting: 2295
## Using parameters --bmax 1722 --dcv 1024
##   Doing ahead-of-time memory usage test
##   Passed!  Constructing with these parameters: --bmax 1722 --dcv 1024
## Constructing suffix-array element generator
## Building DifferenceCoverSample
##   Building sPrime
##   Building sPrimeOrder
##   V-Sorting samples
##   V-Sorting samples time: 00:00:00
##   Allocating rank array
##   Ranking v-sort output
##   Ranking v-sort output time: 00:00:00
##   Invoking Larsson-Sadakane on ranks
##   Invoking Larsson-Sadakane on ranks time: 00:00:00
##   Sanity-checking and returning
## Building samples
## Reserving space for 12 sample suffixes
## Generating random suffixes
## QSorting 12 sample offsets, eliminating duplicates
## QSorting sample offsets, eliminating duplicates time: 00:00:00
## Multikey QSorting 12 samples
##   (Using difference cover)
##   Multikey QSorting samples time: 00:00:00
## Calculating bucket sizes
## Splitting and merging
##   Splitting and merging time: 00:00:00
## Split 2, merged 7; iterating...
## Splitting and merging
##   Splitting and merging time: 00:00:00
## Avg bucket size: 1146.75 (target: 1721)
## Converting suffix-array elements to index image
## Allocating ftab, absorbFtab
## Entering Ebwt loop
## Getting block 1 of 8
##   Reserving size (1722) for bucket 1
##   Calculating Z arrays for bucket 1
##   Entering block accumulator loop for bucket 1:
##   bucket 1: 10%
##   bucket 1: 20%
##   bucket 1: 30%
##   bucket 1: 40%
##   bucket 1: 50%
##   bucket 1: 60%
##   bucket 1: 70%
##   bucket 1: 80%
##   bucket 1: 90%
##   bucket 1: 100%
##   Sorting block of length 170 for bucket 1
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 171 for bucket 1
## Getting block 2 of 8
##   Reserving size (1722) for bucket 2
##   Calculating Z arrays for bucket 2
##   Entering block accumulator loop for bucket 2:
##   bucket 2: 10%
##   bucket 2: 20%
##   bucket 2: 30%
##   bucket 2: 40%
##   bucket 2: 50%
##   bucket 2: 60%
##   bucket 2: 70%
##   bucket 2: 80%
##   bucket 2: 90%
##   bucket 2: 100%
##   Sorting block of length 1666 for bucket 2
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 1667 for bucket 2
## Getting block 3 of 8
##   Reserving size (1722) for bucket 3
##   Calculating Z arrays for bucket 3
##   Entering block accumulator loop for bucket 3:
##   bucket 3: 10%
##   bucket 3: 20%
##   bucket 3: 30%
##   bucket 3: 40%
##   bucket 3: 50%
##   bucket 3: 60%
##   bucket 3: 70%
##   bucket 3: 80%
##   bucket 3: 90%
##   bucket 3: 100%
##   Sorting block of length 1066 for bucket 3
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 1067 for bucket 3
## Getting block 4 of 8
##   Reserving size (1722) for bucket 4
##   Calculating Z arrays for bucket 4
##   Entering block accumulator loop for bucket 4:
##   bucket 4: 10%
##   bucket 4: 20%
##   bucket 4: 30%
##   bucket 4: 40%
##   bucket 4: 50%
##   bucket 4: 60%
##   bucket 4: 70%
##   bucket 4: 80%
##   bucket 4: 90%
##   bucket 4: 100%
##   Sorting block of length 1122 for bucket 4
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 1123 for bucket 4
## Getting block 5 of 8
##   Reserving size (1722) for bucket 5
##   Calculating Z arrays for bucket 5
##   Entering block accumulator loop for bucket 5:
##   bucket 5: 10%
##   bucket 5: 20%
##   bucket 5: 30%
##   bucket 5: 40%
##   bucket 5: 50%
##   bucket 5: 60%
##   bucket 5: 70%
##   bucket 5: 80%
##   bucket 5: 90%
##   bucket 5: 100%
##   Sorting block of length 1351 for bucket 5
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 1352 for bucket 5
## Getting block 6 of 8
##   Reserving size (1722) for bucket 6
##   Calculating Z arrays for bucket 6
##   Entering block accumulator loop for bucket 6:
##   bucket 6: 10%
##   bucket 6: 20%
##   bucket 6: 30%
##   bucket 6: 40%
##   bucket 6: 50%
##   bucket 6: 60%
##   bucket 6: 70%
##   bucket 6: 80%
##   bucket 6: 90%
##   bucket 6: 100%
##   Sorting block of length 1442 for bucket 6
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 1443 for bucket 6
## Getting block 7 of 8
##   Reserving size (1722) for bucket 7
##   Calculating Z arrays for bucket 7
##   Entering block accumulator loop for bucket 7:
##   bucket 7: 10%
##   bucket 7: 20%
##   bucket 7: 30%
##   bucket 7: 40%
##   bucket 7: 50%
##   bucket 7: 60%
##   bucket 7: 70%
##   bucket 7: 80%
##   bucket 7: 90%
##   bucket 7: 100%
##   Sorting block of length 1641 for bucket 7
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 1642 for bucket 7
## Getting block 8 of 8
##   Reserving size (1722) for bucket 8
##   Calculating Z arrays for bucket 8
##   Entering block accumulator loop for bucket 8:
##   bucket 8: 10%
##   bucket 8: 20%
##   bucket 8: 30%
##   bucket 8: 40%
##   bucket 8: 50%
##   bucket 8: 60%
##   bucket 8: 70%
##   bucket 8: 80%
##   bucket 8: 90%
##   bucket 8: 100%
##   Sorting block of length 716 for bucket 8
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 717 for bucket 8
## Exited Ebwt loop
## fchr[A]: 0
## fchr[C]: 3272
## fchr[G]: 4914
## fchr[T]: 7139
## fchr[$]: 9181
## Exiting Ebwt::buildToDisk()
## Returning from initFromVector
## Wrote 4197571 bytes to primary EBWT file: exercises/hiv/AF033819.1.bt2.tmp
## Wrote 2300 bytes to secondary EBWT file: exercises/hiv/AF033819.2.bt2.tmp
## Re-opening _in1 and _in2 as input streams
## Returning from Ebwt constructor
## Headers:
##     len: 9181
##     bwtLen: 9182
##     sz: 2296
##     bwtSz: 2296
##     lineRate: 6
##     offRate: 4
##     offMask: 0xfffffff0
##     ftabChars: 10
##     eftabLen: 20
##     eftabSz: 80
##     ftabLen: 1048577
##     ftabSz: 4194308
##     offsLen: 574
##     offsSz: 2296
##     lineSz: 64
##     sideSz: 64
##     sideBwtSz: 48
##     sideBwtLen: 192
##     numSides: 48
##     numLines: 48
##     ebwtTotLen: 3072
##     ebwtTotSz: 3072
##     color: 0
##     reverse: 0
## Total time for call to driver() for forward index: 00:00:00
## Reading reference sizes
##   Time reading reference sizes: 00:00:00
## Calculating joined length
## Writing header
## Reserving space for joined string
## Joining reference sequences
##   Time to join reference sequences: 00:00:00
##   Time to reverse reference sequence: 00:00:00
## bmax according to bmaxDivN setting: 2295
## Using parameters --bmax 1722 --dcv 1024
##   Doing ahead-of-time memory usage test
##   Passed!  Constructing with these parameters: --bmax 1722 --dcv 1024
## Constructing suffix-array element generator
## Building DifferenceCoverSample
##   Building sPrime
##   Building sPrimeOrder
##   V-Sorting samples
##   V-Sorting samples time: 00:00:00
##   Allocating rank array
##   Ranking v-sort output
##   Ranking v-sort output time: 00:00:00
##   Invoking Larsson-Sadakane on ranks
##   Invoking Larsson-Sadakane on ranks time: 00:00:00
##   Sanity-checking and returning
## Building samples
## Reserving space for 12 sample suffixes
## Generating random suffixes
## QSorting 12 sample offsets, eliminating duplicates
## QSorting sample offsets, eliminating duplicates time: 00:00:00
## Multikey QSorting 12 samples
##   (Using difference cover)
##   Multikey QSorting samples time: 00:00:00
## Calculating bucket sizes
## Splitting and merging
##   Splitting and merging time: 00:00:00
## Split 1, merged 6; iterating...
## Splitting and merging
##   Splitting and merging time: 00:00:00
## Avg bucket size: 1146.75 (target: 1721)
## Converting suffix-array elements to index image
## Allocating ftab, absorbFtab
## Entering Ebwt loop
## Getting block 1 of 8
##   Reserving size (1722) for bucket 1
##   Calculating Z arrays for bucket 1
##   Entering block accumulator loop for bucket 1:
##   bucket 1: 10%
##   bucket 1: 20%
##   bucket 1: 30%
##   bucket 1: 40%
##   bucket 1: 50%
##   bucket 1: 60%
##   bucket 1: 70%
##   bucket 1: 80%
##   bucket 1: 90%
##   bucket 1: 100%
##   Sorting block of length 1457 for bucket 1
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 1458 for bucket 1
## Getting block 2 of 8
##   Reserving size (1722) for bucket 2
##   Calculating Z arrays for bucket 2
##   Entering block accumulator loop for bucket 2:
##   bucket 2: 10%
##   bucket 2: 20%
##   bucket 2: 30%
##   bucket 2: 40%
##   bucket 2: 50%
##   bucket 2: 60%
##   bucket 2: 70%
##   bucket 2: 80%
##   bucket 2: 90%
##   bucket 2: 100%
##   Sorting block of length 282 for bucket 2
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 283 for bucket 2
## Getting block 3 of 8
##   Reserving size (1722) for bucket 3
##   Calculating Z arrays for bucket 3
##   Entering block accumulator loop for bucket 3:
##   bucket 3: 10%
##   bucket 3: 20%
##   bucket 3: 30%
##   bucket 3: 40%
##   bucket 3: 50%
##   bucket 3: 60%
##   bucket 3: 70%
##   bucket 3: 80%
##   bucket 3: 90%
##   bucket 3: 100%
##   Sorting block of length 1451 for bucket 3
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 1452 for bucket 3
## Getting block 4 of 8
##   Reserving size (1722) for bucket 4
##   Calculating Z arrays for bucket 4
##   Entering block accumulator loop for bucket 4:
##   bucket 4: 10%
##   bucket 4: 20%
##   bucket 4: 30%
##   bucket 4: 40%
##   bucket 4: 50%
##   bucket 4: 60%
##   bucket 4: 70%
##   bucket 4: 80%
##   bucket 4: 90%
##   bucket 4: 100%
##   Sorting block of length 652 for bucket 4
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 653 for bucket 4
## Getting block 5 of 8
##   Reserving size (1722) for bucket 5
##   Calculating Z arrays for bucket 5
##   Entering block accumulator loop for bucket 5:
##   bucket 5: 10%
##   bucket 5: 20%
##   bucket 5: 30%
##   bucket 5: 40%
##   bucket 5: 50%
##   bucket 5: 60%
##   bucket 5: 70%
##   bucket 5: 80%
##   bucket 5: 90%
##   bucket 5: 100%
##   Sorting block of length 1423 for bucket 5
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 1424 for bucket 5
## Getting block 6 of 8
##   Reserving size (1722) for bucket 6
##   Calculating Z arrays for bucket 6
##   Entering block accumulator loop for bucket 6:
##   bucket 6: 10%
##   bucket 6: 20%
##   bucket 6: 30%
##   bucket 6: 40%
##   bucket 6: 50%
##   bucket 6: 60%
##   bucket 6: 70%
##   bucket 6: 80%
##   bucket 6: 90%
##   bucket 6: 100%
##   Sorting block of length 906 for bucket 6
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 907 for bucket 6
## Getting block 7 of 8
##   Reserving size (1722) for bucket 7
##   Calculating Z arrays for bucket 7
##   Entering block accumulator loop for bucket 7:
##   bucket 7: 10%
##   bucket 7: 20%
##   bucket 7: 30%
##   bucket 7: 40%
##   bucket 7: 50%
##   bucket 7: 60%
##   bucket 7: 70%
##   bucket 7: 80%
##   bucket 7: 90%
##   bucket 7: 100%
##   Sorting block of length 1721 for bucket 7
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 1722 for bucket 7
## Getting block 8 of 8
##   Reserving size (1722) for bucket 8
##   Calculating Z arrays for bucket 8
##   Entering block accumulator loop for bucket 8:
##   bucket 8: 10%
##   bucket 8: 20%
##   bucket 8: 30%
##   bucket 8: 40%
##   bucket 8: 50%
##   bucket 8: 60%
##   bucket 8: 70%
##   bucket 8: 80%
##   bucket 8: 90%
##   bucket 8: 100%
##   Sorting block of length 1282 for bucket 8
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 1283 for bucket 8
## Exited Ebwt loop
## fchr[A]: 0
## fchr[C]: 3272
## fchr[G]: 4914
## fchr[T]: 7139
## fchr[$]: 9181
## Exiting Ebwt::buildToDisk()
## Returning from initFromVector
## Wrote 4197571 bytes to primary EBWT file: exercises/hiv/AF033819.rev.1.bt2.tmp
## Wrote 2300 bytes to secondary EBWT file: exercises/hiv/AF033819.rev.2.bt2.tmp
## Re-opening _in1 and _in2 as input streams
## Returning from Ebwt constructor
## Headers:
##     len: 9181
##     bwtLen: 9182
##     sz: 2296
##     bwtSz: 2296
##     lineRate: 6
##     offRate: 4
##     offMask: 0xfffffff0
##     ftabChars: 10
##     eftabLen: 20
##     eftabSz: 80
##     ftabLen: 1048577
##     ftabSz: 4194308
##     offsLen: 574
##     offsSz: 2296
##     lineSz: 64
##     sideSz: 64
##     sideBwtSz: 48
##     sideBwtLen: 192
##     numSides: 48
##     numLines: 48
##     ebwtTotLen: 3072
##     ebwtTotSz: 3072
##     color: 0
##     reverse: 1
## Total time for backward call to driver() for mirror index: 00:00:00
## Renaming exercises/hiv/AF033819.3.bt2.tmp to exercises/hiv/AF033819.3.bt2
## Renaming exercises/hiv/AF033819.4.bt2.tmp to exercises/hiv/AF033819.4.bt2
## Renaming exercises/hiv/AF033819.1.bt2.tmp to exercises/hiv/AF033819.1.bt2
## Renaming exercises/hiv/AF033819.2.bt2.tmp to exercises/hiv/AF033819.2.bt2
## Renaming exercises/hiv/AF033819.rev.1.bt2.tmp to exercises/hiv/AF033819.rev.1.bt2
## Renaming exercises/hiv/AF033819.rev.2.bt2.tmp to exercises/hiv/AF033819.rev.2.bt2
  1. Align the sequencing reads to the HIV genome use the Bowtie 2 aligner. Again, use a for loop to construct and run each of the commands:
bash
while read RUN; do

  bowtie2 --very-sensitive -x exercises/hiv/AF033819 -U exercises/hiv/${RUN}.fastq > exercises/hiv/${RUN}.sam

done < exercises/hiv/runids.txt
## 1000 reads; of these:
##   1000 (100.00%) were unpaired; of these:
##     740 (74.00%) aligned 0 times
##     260 (26.00%) aligned exactly 1 time
##     0 (0.00%) aligned >1 times
## 26.00% overall alignment rate
## 1000 reads; of these:
##   1000 (100.00%) were unpaired; of these:
##     162 (16.20%) aligned 0 times
##     838 (83.80%) aligned exactly 1 time
##     0 (0.00%) aligned >1 times
## 83.80% overall alignment rate
## 1000 reads; of these:
##   1000 (100.00%) were unpaired; of these:
##     52 (5.20%) aligned 0 times
##     948 (94.80%) aligned exactly 1 time
##     0 (0.00%) aligned >1 times
## 94.80% overall alignment rate
## 1000 reads; of these:
##   1000 (100.00%) were unpaired; of these:
##     355 (35.50%) aligned 0 times
##     645 (64.50%) aligned exactly 1 time
##     0 (0.00%) aligned >1 times
## 64.50% overall alignment rate
## 1000 reads; of these:
##   1000 (100.00%) were unpaired; of these:
##     234 (23.40%) aligned 0 times
##     765 (76.50%) aligned exactly 1 time
##     1 (0.10%) aligned >1 times
## 76.60% overall alignment rate
  1. Calculate the alignment statistics for each of the runs. Use a for loop as before:
bash
while read RUN; do

  samtools flagstat exercises/hiv/${RUN}.sam > exercises/hiv/${RUN}_flagstat.txt

done < exercises/hiv/runids.txt
  1. Use grep to find and report the number of mapped reads for each run:
bash
# limit to first match (-m 1)
grep -m 1 "mapped" exercises/hiv/*_flagstat.txt
## exercises/hiv/SRR9008461_flagstat.txt:260 + 0 mapped (26.00% : N/A)
## exercises/hiv/SRR9008491_flagstat.txt:766 + 0 mapped (76.60% : N/A)
## exercises/hiv/SRR9008492_flagstat.txt:645 + 0 mapped (64.50% : N/A)
## exercises/hiv/SRR9008493_flagstat.txt:948 + 0 mapped (94.80% : N/A)
## exercises/hiv/SRR9008494_flagstat.txt:838 + 0 mapped (83.80% : N/A)

4.3 Dengue

Dengue is a mosquito-borne viral infection that is common in warm, tropical climates. Infection is caused by any one of four closely related dengue viruses (called serotypes) and these can lead to a wide spectrum of symptoms, including some which are extremely mild (unnoticeable) to those that may require medical intervention and hospitalization. In severe cases, fatalities can occur. There is no treatment for the infection itself but the symptoms that a patient experiences can be managed.
World Health Organization
  1. Use the efetch command from EDirect to download and save the Dengue reference genome (NC_001477):
bash
efetch -db nuccore -id NC_001477 -format fasta > exercises/dengue/NC_001477.fasta
  1. Use the esearch and efetch commands to download and save the run information from a public Dengue sequencing project (PRJNA494391):
bash
esearch -db sra -query PRJNA494391 | efetch -format runinfo > exercises/dengue/runinfo.csv
  1. Create a file which contains the run accession numbers of those runs which have more than 1,500,000 reads. Column 4 of the run information file contains the number of reads for each run:
bash
tail -n +2 exercises/dengue/runinfo.csv | grep "Dengue virus" | awk -F "," '($4 > 1500000)' | cut -d "," -f 1 > exercises/dengue/runids.txt
  1. Use the fastq-dump command to download 100,000 paired-end reads from each of the runs listed in the project. Instead of manually creating the command for each run accession number, use a for loop to read each run accession number from the file created in Q3:
bash
while read RUN; do

  fastq-dump -X 100000 -O exercises/dengue --split-files ${RUN}

done < exercises/dengue/runids.txt
## Read 100000 spots for SRR8069228
## Written 100000 spots for SRR8069228
## Read 100000 spots for SRR8069229
## Written 100000 spots for SRR8069229
  1. Use the BWA aligner to index the Dengue genome:
bash
bwa index -p exercises/dengue/NC_001477 exercises/dengue/NC_001477.fasta
## [bwa_index] Pack FASTA... 0.00 sec
## [bwa_index] Construct BWT for the packed sequence...
## [bwa_index] 0.00 seconds elapse.
## [bwa_index] Update BWT... 0.00 sec
## [bwa_index] Pack forward-only FASTA... 0.00 sec
## [bwa_index] Construct SA from BWT and Occ... 0.00 sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa index -p exercises/dengue/NC_001477 exercises/dengue/NC_001477.fasta
## [main] Real time: 0.004 sec; CPU: 0.021 sec
  1. Use the Bowtie 2 aligner to index the Dengue genome:
bash
bowtie2-build exercises/dengue/NC_001477.fasta exercises/dengue/NC_001477
## Settings:
##   Output files: "exercises/dengue/NC_001477.*.bt2"
##   Line rate: 6 (line is 64 bytes)
##   Lines per side: 1 (side is 64 bytes)
##   Offset rate: 4 (one in 16)
##   FTable chars: 10
##   Strings: unpacked
##   Max bucket size: default
##   Max bucket size, sqrt multiplier: default
##   Max bucket size, len divisor: 4
##   Difference-cover sample period: 1024
##   Endianness: little
##   Actual local endianness: little
##   Sanity checking: disabled
##   Assertions: disabled
##   Random seed: 0
##   Sizeofs: void*:8, int:4, long:8, size_t:8
## Input files DNA, FASTA:
##   exercises/dengue/NC_001477.fasta
## Building a SMALL index
## Reading reference sizes
##   Time reading reference sizes: 00:00:00
## Calculating joined length
## Writing header
## Reserving space for joined string
## Joining reference sequences
##   Time to join reference sequences: 00:00:00
## bmax according to bmaxDivN setting: 2683
## Using parameters --bmax 2013 --dcv 1024
##   Doing ahead-of-time memory usage test
##   Passed!  Constructing with these parameters: --bmax 2013 --dcv 1024
## Constructing suffix-array element generator
## Building DifferenceCoverSample
##   Building sPrime
##   Building sPrimeOrder
##   V-Sorting samples
##   V-Sorting samples time: 00:00:00
##   Allocating rank array
##   Ranking v-sort output
##   Ranking v-sort output time: 00:00:00
##   Invoking Larsson-Sadakane on ranks
##   Invoking Larsson-Sadakane on ranks time: 00:00:00
##   Sanity-checking and returning
## Building samples
## Reserving space for 12 sample suffixes
## Generating random suffixes
## QSorting 12 sample offsets, eliminating duplicates
## QSorting sample offsets, eliminating duplicates time: 00:00:00
## Multikey QSorting 12 samples
##   (Using difference cover)
##   Multikey QSorting samples time: 00:00:00
## Calculating bucket sizes
## Splitting and merging
##   Splitting and merging time: 00:00:00
## Split 2, merged 7; iterating...
## Splitting and merging
##   Splitting and merging time: 00:00:00
## Split 1, merged 2; iterating...
## Splitting and merging
##   Splitting and merging time: 00:00:00
## Split 1, merged 1; iterating...
## Splitting and merging
##   Splitting and merging time: 00:00:00
## Split 1, merged 1; iterating...
## Splitting and merging
##   Splitting and merging time: 00:00:00
## Split 1, merged 1; iterating...
## Avg bucket size: 1532.71 (target: 2012)
## Converting suffix-array elements to index image
## Allocating ftab, absorbFtab
## Entering Ebwt loop
## Getting block 1 of 7
##   Reserving size (2013) for bucket 1
##   Calculating Z arrays for bucket 1
##   Entering block accumulator loop for bucket 1:
##   bucket 1: 10%
##   bucket 1: 20%
##   bucket 1: 30%
##   bucket 1: 40%
##   bucket 1: 50%
##   bucket 1: 60%
##   bucket 1: 70%
##   bucket 1: 80%
##   bucket 1: 90%
##   bucket 1: 100%
##   Sorting block of length 1449 for bucket 1
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 1450 for bucket 1
## Getting block 2 of 7
##   Reserving size (2013) for bucket 2
##   Calculating Z arrays for bucket 2
##   Entering block accumulator loop for bucket 2:
##   bucket 2: 10%
##   bucket 2: 20%
##   bucket 2: 30%
##   bucket 2: 40%
##   bucket 2: 50%
##   bucket 2: 60%
##   bucket 2: 70%
##   bucket 2: 80%
##   bucket 2: 90%
##   bucket 2: 100%
##   Sorting block of length 1668 for bucket 2
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 1669 for bucket 2
## Getting block 3 of 7
##   Reserving size (2013) for bucket 3
##   Calculating Z arrays for bucket 3
##   Entering block accumulator loop for bucket 3:
##   bucket 3: 10%
##   bucket 3: 20%
##   bucket 3: 30%
##   bucket 3: 40%
##   bucket 3: 50%
##   bucket 3: 60%
##   bucket 3: 70%
##   bucket 3: 80%
##   bucket 3: 90%
##   bucket 3: 100%
##   Sorting block of length 1970 for bucket 3
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 1971 for bucket 3
## Getting block 4 of 7
##   Reserving size (2013) for bucket 4
##   Calculating Z arrays for bucket 4
##   Entering block accumulator loop for bucket 4:
##   bucket 4: 10%
##   bucket 4: 20%
##   bucket 4: 30%
##   bucket 4: 40%
##   bucket 4: 50%
##   bucket 4: 60%
##   bucket 4: 70%
##   bucket 4: 80%
##   bucket 4: 90%
##   bucket 4: 100%
##   Sorting block of length 2005 for bucket 4
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 2006 for bucket 4
## Getting block 5 of 7
##   Reserving size (2013) for bucket 5
##   Calculating Z arrays for bucket 5
##   Entering block accumulator loop for bucket 5:
##   bucket 5: 10%
##   bucket 5: 20%
##   bucket 5: 30%
##   bucket 5: 40%
##   bucket 5: 50%
##   bucket 5: 60%
##   bucket 5: 70%
##   bucket 5: 80%
##   bucket 5: 90%
##   bucket 5: 100%
##   Sorting block of length 1766 for bucket 5
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 1767 for bucket 5
## Getting block 6 of 7
##   Reserving size (2013) for bucket 6
##   Calculating Z arrays for bucket 6
##   Entering block accumulator loop for bucket 6:
##   bucket 6: 10%
##   bucket 6: 20%
##   bucket 6: 30%
##   bucket 6: 40%
##   bucket 6: 50%
##   bucket 6: 60%
##   bucket 6: 70%
##   bucket 6: 80%
##   bucket 6: 90%
##   bucket 6: 100%
##   Sorting block of length 1080 for bucket 6
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 1081 for bucket 6
## Getting block 7 of 7
##   Reserving size (2013) for bucket 7
##   Calculating Z arrays for bucket 7
##   Entering block accumulator loop for bucket 7:
##   bucket 7: 10%
##   bucket 7: 20%
##   bucket 7: 30%
##   bucket 7: 40%
##   bucket 7: 50%
##   bucket 7: 60%
##   bucket 7: 70%
##   bucket 7: 80%
##   bucket 7: 90%
##   bucket 7: 100%
##   Sorting block of length 791 for bucket 7
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 792 for bucket 7
## Exited Ebwt loop
## fchr[A]: 0
## fchr[C]: 3426
## fchr[G]: 5666
## fchr[T]: 8436
## fchr[$]: 10735
## Exiting Ebwt::buildToDisk()
## Returning from initFromVector
## Wrote 4198093 bytes to primary EBWT file: exercises/dengue/NC_001477.1.bt2.tmp
## Wrote 2688 bytes to secondary EBWT file: exercises/dengue/NC_001477.2.bt2.tmp
## Re-opening _in1 and _in2 as input streams
## Returning from Ebwt constructor
## Headers:
##     len: 10735
##     bwtLen: 10736
##     sz: 2684
##     bwtSz: 2684
##     lineRate: 6
##     offRate: 4
##     offMask: 0xfffffff0
##     ftabChars: 10
##     eftabLen: 20
##     eftabSz: 80
##     ftabLen: 1048577
##     ftabSz: 4194308
##     offsLen: 671
##     offsSz: 2684
##     lineSz: 64
##     sideSz: 64
##     sideBwtSz: 48
##     sideBwtLen: 192
##     numSides: 56
##     numLines: 56
##     ebwtTotLen: 3584
##     ebwtTotSz: 3584
##     color: 0
##     reverse: 0
## Total time for call to driver() for forward index: 00:00:00
## Reading reference sizes
##   Time reading reference sizes: 00:00:00
## Calculating joined length
## Writing header
## Reserving space for joined string
## Joining reference sequences
##   Time to join reference sequences: 00:00:00
##   Time to reverse reference sequence: 00:00:00
## bmax according to bmaxDivN setting: 2683
## Using parameters --bmax 2013 --dcv 1024
##   Doing ahead-of-time memory usage test
##   Passed!  Constructing with these parameters: --bmax 2013 --dcv 1024
## Constructing suffix-array element generator
## Building DifferenceCoverSample
##   Building sPrime
##   Building sPrimeOrder
##   V-Sorting samples
##   V-Sorting samples time: 00:00:00
##   Allocating rank array
##   Ranking v-sort output
##   Ranking v-sort output time: 00:00:00
##   Invoking Larsson-Sadakane on ranks
##   Invoking Larsson-Sadakane on ranks time: 00:00:00
##   Sanity-checking and returning
## Building samples
## Reserving space for 12 sample suffixes
## Generating random suffixes
## QSorting 12 sample offsets, eliminating duplicates
## QSorting sample offsets, eliminating duplicates time: 00:00:00
## Multikey QSorting 12 samples
##   (Using difference cover)
##   Multikey QSorting samples time: 00:00:00
## Calculating bucket sizes
## Splitting and merging
##   Splitting and merging time: 00:00:00
## Avg bucket size: 1532.71 (target: 2012)
## Converting suffix-array elements to index image
## Allocating ftab, absorbFtab
## Entering Ebwt loop
## Getting block 1 of 7
##   Reserving size (2013) for bucket 1
##   Calculating Z arrays for bucket 1
##   Entering block accumulator loop for bucket 1:
##   bucket 1: 10%
##   bucket 1: 20%
##   bucket 1: 30%
##   bucket 1: 40%
##   bucket 1: 50%
##   bucket 1: 60%
##   bucket 1: 70%
##   bucket 1: 80%
##   bucket 1: 90%
##   bucket 1: 100%
##   Sorting block of length 1082 for bucket 1
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 1083 for bucket 1
## Getting block 2 of 7
##   Reserving size (2013) for bucket 2
##   Calculating Z arrays for bucket 2
##   Entering block accumulator loop for bucket 2:
##   bucket 2: 10%
##   bucket 2: 20%
##   bucket 2: 30%
##   bucket 2: 40%
##   bucket 2: 50%
##   bucket 2: 60%
##   bucket 2: 70%
##   bucket 2: 80%
##   bucket 2: 90%
##   bucket 2: 100%
##   Sorting block of length 1956 for bucket 2
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 1957 for bucket 2
## Getting block 3 of 7
##   Reserving size (2013) for bucket 3
##   Calculating Z arrays for bucket 3
##   Entering block accumulator loop for bucket 3:
##   bucket 3: 10%
##   bucket 3: 20%
##   bucket 3: 30%
##   bucket 3: 40%
##   bucket 3: 50%
##   bucket 3: 60%
##   bucket 3: 70%
##   bucket 3: 80%
##   bucket 3: 90%
##   bucket 3: 100%
##   Sorting block of length 2009 for bucket 3
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 2010 for bucket 3
## Getting block 4 of 7
##   Reserving size (2013) for bucket 4
##   Calculating Z arrays for bucket 4
##   Entering block accumulator loop for bucket 4:
##   bucket 4: 10%
##   bucket 4: 20%
##   bucket 4: 30%
##   bucket 4: 40%
##   bucket 4: 50%
##   bucket 4: 60%
##   bucket 4: 70%
##   bucket 4: 80%
##   bucket 4: 90%
##   bucket 4: 100%
##   Sorting block of length 1882 for bucket 4
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 1883 for bucket 4
## Getting block 5 of 7
##   Reserving size (2013) for bucket 5
##   Calculating Z arrays for bucket 5
##   Entering block accumulator loop for bucket 5:
##   bucket 5: 10%
##   bucket 5: 20%
##   bucket 5: 30%
##   bucket 5: 40%
##   bucket 5: 50%
##   bucket 5: 60%
##   bucket 5: 70%
##   bucket 5: 80%
##   bucket 5: 90%
##   bucket 5: 100%
##   Sorting block of length 1056 for bucket 5
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 1057 for bucket 5
## Getting block 6 of 7
##   Reserving size (2013) for bucket 6
##   Calculating Z arrays for bucket 6
##   Entering block accumulator loop for bucket 6:
##   bucket 6: 10%
##   bucket 6: 20%
##   bucket 6: 30%
##   bucket 6: 40%
##   bucket 6: 50%
##   bucket 6: 60%
##   bucket 6: 70%
##   bucket 6: 80%
##   bucket 6: 90%
##   bucket 6: 100%
##   Sorting block of length 1079 for bucket 6
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 1080 for bucket 6
## Getting block 7 of 7
##   Reserving size (2013) for bucket 7
##   Calculating Z arrays for bucket 7
##   Entering block accumulator loop for bucket 7:
##   bucket 7: 10%
##   bucket 7: 20%
##   bucket 7: 30%
##   bucket 7: 40%
##   bucket 7: 50%
##   bucket 7: 60%
##   bucket 7: 70%
##   bucket 7: 80%
##   bucket 7: 90%
##   bucket 7: 100%
##   Sorting block of length 1665 for bucket 7
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 1666 for bucket 7
## Exited Ebwt loop
## fchr[A]: 0
## fchr[C]: 3426
## fchr[G]: 5666
## fchr[T]: 8436
## fchr[$]: 10735
## Exiting Ebwt::buildToDisk()
## Returning from initFromVector
## Wrote 4198093 bytes to primary EBWT file: exercises/dengue/NC_001477.rev.1.bt2.tmp
## Wrote 2688 bytes to secondary EBWT file: exercises/dengue/NC_001477.rev.2.bt2.tmp
## Re-opening _in1 and _in2 as input streams
## Returning from Ebwt constructor
## Headers:
##     len: 10735
##     bwtLen: 10736
##     sz: 2684
##     bwtSz: 2684
##     lineRate: 6
##     offRate: 4
##     offMask: 0xfffffff0
##     ftabChars: 10
##     eftabLen: 20
##     eftabSz: 80
##     ftabLen: 1048577
##     ftabSz: 4194308
##     offsLen: 671
##     offsSz: 2684
##     lineSz: 64
##     sideSz: 64
##     sideBwtSz: 48
##     sideBwtLen: 192
##     numSides: 56
##     numLines: 56
##     ebwtTotLen: 3584
##     ebwtTotSz: 3584
##     color: 0
##     reverse: 1
## Total time for backward call to driver() for mirror index: 00:00:00
## Renaming exercises/dengue/NC_001477.3.bt2.tmp to exercises/dengue/NC_001477.3.bt2
## Renaming exercises/dengue/NC_001477.4.bt2.tmp to exercises/dengue/NC_001477.4.bt2
## Renaming exercises/dengue/NC_001477.1.bt2.tmp to exercises/dengue/NC_001477.1.bt2
## Renaming exercises/dengue/NC_001477.2.bt2.tmp to exercises/dengue/NC_001477.2.bt2
## Renaming exercises/dengue/NC_001477.rev.1.bt2.tmp to exercises/dengue/NC_001477.rev.1.bt2
## Renaming exercises/dengue/NC_001477.rev.2.bt2.tmp to exercises/dengue/NC_001477.rev.2.bt2
  1. Align the sequencing reads to the Dengue genome use both the Bowtie 2 and BWA aligners. Again, use a for loop to construct and run each of the commands:
bash
while read RUN; do

  bwa mem exercises/dengue/NC_001477 exercises/dengue/${RUN}_1.fastq exercises/dengue/${RUN}_2.fastq > exercises/dengue/${RUN}_bwa.sam

  bowtie2 -x exercises/dengue/NC_001477 -1 exercises/dengue/${RUN}_1.fastq -2 exercises/dengue/${RUN}_2.fastq > exercises/dengue/${RUN}_bowtie2.sam

done < exercises/dengue/runids.txt
## [M::bwa_idx_load_from_disk] read 0 ALT contigs
## [M::process] read 68646 sequences (10000256 bp)...
## [M::process] read 68728 sequences (10000112 bp)...
## [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (7649, 8664, 2, 7447)
## [M::mem_pestat] analyzing insert size distribution for orientation FF...
## [M::mem_pestat] (25, 50, 75) percentile: (20, 42, 70)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 170)
## [M::mem_pestat] mean and std.dev: (46.94, 31.55)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 220)
## [M::mem_pestat] analyzing insert size distribution for orientation FR...
## [M::mem_pestat] (25, 50, 75) percentile: (217, 265, 306)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (39, 484)
## [M::mem_pestat] mean and std.dev: (261.49, 69.92)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 573)
## [M::mem_pestat] skip orientation RF as there are not enough pairs
## [M::mem_pestat] analyzing insert size distribution for orientation RR...
## [M::mem_pestat] (25, 50, 75) percentile: (19, 40, 66)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 160)
## [M::mem_pestat] mean and std.dev: (44.56, 30.72)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 207)
## [M::mem_process_seqs] Processed 68646 reads in 7.466 CPU sec, 7.433 real sec
## [M::process] read 62626 sequences (9120396 bp)...
## [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (7560, 8655, 2, 7568)
## [M::mem_pestat] analyzing insert size distribution for orientation FF...
## [M::mem_pestat] (25, 50, 75) percentile: (20, 42, 70)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 170)
## [M::mem_pestat] mean and std.dev: (46.71, 31.48)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 220)
## [M::mem_pestat] analyzing insert size distribution for orientation FR...
## [M::mem_pestat] (25, 50, 75) percentile: (215, 265, 306)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (33, 488)
## [M::mem_pestat] mean and std.dev: (261.58, 71.85)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 579)
## [M::mem_pestat] skip orientation RF as there are not enough pairs
## [M::mem_pestat] analyzing insert size distribution for orientation RR...
## [M::mem_pestat] (25, 50, 75) percentile: (18, 39, 66)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 162)
## [M::mem_pestat] mean and std.dev: (44.19, 30.65)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 210)
## [M::mem_process_seqs] Processed 68728 reads in 7.567 CPU sec, 7.505 real sec
## [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (6874, 8110, 3, 6750)
## [M::mem_pestat] analyzing insert size distribution for orientation FF...
## [M::mem_pestat] (25, 50, 75) percentile: (20, 42, 69)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 167)
## [M::mem_pestat] mean and std.dev: (46.73, 31.55)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 216)
## [M::mem_pestat] analyzing insert size distribution for orientation FR...
## [M::mem_pestat] (25, 50, 75) percentile: (211, 263, 305)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (23, 493)
## [M::mem_pestat] mean and std.dev: (259.36, 72.29)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 587)
## [M::mem_pestat] skip orientation RF as there are not enough pairs
## [M::mem_pestat] analyzing insert size distribution for orientation RR...
## [M::mem_pestat] (25, 50, 75) percentile: (17, 39, 65)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 161)
## [M::mem_pestat] mean and std.dev: (43.53, 30.64)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 209)
## [M::mem_process_seqs] Processed 62626 reads in 6.920 CPU sec, 6.898 real sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa mem exercises/dengue/NC_001477 exercises/dengue/SRR8069228_1.fastq exercises/dengue/SRR8069228_2.fastq
## [main] Real time: 21.909 sec; CPU: 22.027 sec
## 100000 reads; of these:
##   100000 (100.00%) were paired; of these:
##     76891 (76.89%) aligned concordantly 0 times
##     23109 (23.11%) aligned concordantly exactly 1 time
##     0 (0.00%) aligned concordantly >1 times
##     ----
##     76891 pairs aligned concordantly 0 times; of these:
##       38413 (49.96%) aligned discordantly 1 time
##     ----
##     38478 pairs aligned 0 times concordantly or discordantly; of these:
##       76956 mates make up the pairs; of these:
##         56759 (73.76%) aligned 0 times
##         20197 (26.24%) aligned exactly 1 time
##         0 (0.00%) aligned >1 times
## 71.62% overall alignment rate
## [M::bwa_idx_load_from_disk] read 0 ALT contigs
## [M::process] read 68464 sequences (10000143 bp)...
## [M::process] read 68462 sequences (10000296 bp)...
## [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (7349, 8761, 7, 7228)
## [M::mem_pestat] analyzing insert size distribution for orientation FF...
## [M::mem_pestat] (25, 50, 75) percentile: (20, 40, 69)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 167)
## [M::mem_pestat] mean and std.dev: (45.96, 31.57)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 216)
## [M::mem_pestat] analyzing insert size distribution for orientation FR...
## [M::mem_pestat] (25, 50, 75) percentile: (217, 269, 309)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (33, 493)
## [M::mem_pestat] mean and std.dev: (264.46, 72.46)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 585)
## [M::mem_pestat] skip orientation RF as there are not enough pairs
## [M::mem_pestat] analyzing insert size distribution for orientation RR...
## [M::mem_pestat] (25, 50, 75) percentile: (18, 39, 66)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 162)
## [M::mem_pestat] mean and std.dev: (44.24, 30.99)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 210)
## [M::mem_process_seqs] Processed 68464 reads in 7.433 CPU sec, 7.400 real sec
## [M::process] read 63074 sequences (9221933 bp)...
## [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (7477, 8691, 2, 7439)
## [M::mem_pestat] analyzing insert size distribution for orientation FF...
## [M::mem_pestat] (25, 50, 75) percentile: (19, 41, 69)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 169)
## [M::mem_pestat] mean and std.dev: (46.10, 31.59)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 219)
## [M::mem_pestat] analyzing insert size distribution for orientation FR...
## [M::mem_pestat] (25, 50, 75) percentile: (218, 269, 309)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (36, 491)
## [M::mem_pestat] mean and std.dev: (265.25, 71.67)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 582)
## [M::mem_pestat] skip orientation RF as there are not enough pairs
## [M::mem_pestat] analyzing insert size distribution for orientation RR...
## [M::mem_pestat] (25, 50, 75) percentile: (18, 39, 65)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 159)
## [M::mem_pestat] mean and std.dev: (44.12, 30.62)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 206)
## [M::mem_process_seqs] Processed 68462 reads in 7.474 CPU sec, 7.410 real sec
## [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (6753, 8274, 1, 6821)
## [M::mem_pestat] analyzing insert size distribution for orientation FF...
## [M::mem_pestat] (25, 50, 75) percentile: (19, 41, 69)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 169)
## [M::mem_pestat] mean and std.dev: (46.17, 32.14)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 219)
## [M::mem_pestat] analyzing insert size distribution for orientation FR...
## [M::mem_pestat] (25, 50, 75) percentile: (211, 264, 304)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (25, 490)
## [M::mem_pestat] mean and std.dev: (259.47, 72.16)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 583)
## [M::mem_pestat] skip orientation RF as there are not enough pairs
## [M::mem_pestat] analyzing insert size distribution for orientation RR...
## [M::mem_pestat] (25, 50, 75) percentile: (17, 38, 65)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 161)
## [M::mem_pestat] mean and std.dev: (43.31, 30.89)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 209)
## [M::mem_process_seqs] Processed 63074 reads in 6.931 CPU sec, 6.911 real sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa mem exercises/dengue/NC_001477 exercises/dengue/SRR8069229_1.fastq exercises/dengue/SRR8069229_2.fastq
## [main] Real time: 21.806 sec; CPU: 21.919 sec
## 100000 reads; of these:
##   100000 (100.00%) were paired; of these:
##     76672 (76.67%) aligned concordantly 0 times
##     23328 (23.33%) aligned concordantly exactly 1 time
##     0 (0.00%) aligned concordantly >1 times
##     ----
##     76672 pairs aligned concordantly 0 times; of these:
##       37483 (48.89%) aligned discordantly 1 time
##     ----
##     39189 pairs aligned 0 times concordantly or discordantly; of these:
##       78378 mates make up the pairs; of these:
##         59066 (75.36%) aligned 0 times
##         19312 (24.64%) aligned exactly 1 time
##         0 (0.00%) aligned >1 times
## 70.47% overall alignment rate
  1. Calculate the alignment statistics for each of the runs using both aligners. Use a for loop as before:
bash
while read RUN; do

  samtools flagstat exercises/dengue/${RUN}_bwa.sam > exercises/dengue/${RUN}_bwa_flagstat.txt

  samtools flagstat exercises/dengue/${RUN}_bowtie2.sam > exercises/dengue/${RUN}_bowtie2_flagstat.txt

done < exercises/dengue/runids.txt
  1. Use grep to find and report the number of mapped reads for each run using the BWA aligner:
bash
# limit to first match (-m 1)
grep -m 1 "mapped" exercises/dengue/*_bwa_flagstat.txt
## exercises/dengue/SRR8069228_bwa_flagstat.txt:178963 + 0 mapped (89.33% : N/A)
## exercises/dengue/SRR8069229_bwa_flagstat.txt:176126 + 0 mapped (87.93% : N/A)
  1. Use grep to find and report the number of mapped reads for each run using the Bowtie 2 aligner:
bash
# limit to first match (-m 1)
grep -m 1 "mapped" exercises/dengue/*_bowtie2_flagstat.txt
## exercises/dengue/SRR8069228_bowtie2_flagstat.txt:143241 + 0 mapped (71.62% : N/A)
## exercises/dengue/SRR8069229_bowtie2_flagstat.txt:140934 + 0 mapped (70.47% : N/A)